AIML Capstone Jan'24 A Capstone Topic - NLP1 Group 3¶

PROBLEM STATEMENT

DOMAIN: Industrial safety. NLP based Chatbot.

CONTEXT: The database comes fromone of the biggest industry in Brazil and in the world. It is an urgent need for industries/companies around the globe to understand why employees still suffer some injuries/accidents in plants. Sometimes they also die in such environment.

DATA DESCRIPTION:This The database is basically records of accidents from12 different plants in 03 different countrieswhich every line in the data is an occurrence of an accident.Columns description: ‣Data: timestamp or time/date information‣Countries: which country the accident occurred (anonymised)‣Local: the city where the manufacturing plant is located (anonymised)‣Industry sector: which sector the plant belongs to‣Accident level: from I to VI, it registers how severe was the accident (I means not severe but VI means very severe)‣Potential Accident Level: Depending on the Accident Level, the database also registers how severe the accident could have been (due to other factors involved in the accident)‣Genre: if the person is male of female‣Employee or Third Party: if the injured person is an employee or a third party‣Critical Risk: some description of the risk involved in the accident‣Description: Detailed description of how the accident happened.

PROJECT OBJECTIVE:Design a ML/DL based chatbot utility which can help the professionals to highlight the safety risk as per the incident description.

Milestone 1

Input: Context and Dataset

1.1 Overview of Dataset

Data: Timestamp or time/date information

Countries : Country of the accident occurrence (anonymized)

Local: City of accident occurence (anonymized)

Industry Sector: Industrial sector of accident occurence

Accident Level: from I to VI, it indicates the severity of the accident

Potential Accident Level : This captures the potential for escalation of the accident

Genre : The gender of the injured party, whether male or female

Employee ou Terceiro : Worker classification if the injured party is an employee or a third party (Contractor)

Risco Critico : Description of the agency and immediate cause of the accident

Description : Detailed description of how the accident occured

Note:

Accident Level (Severity) Classification Since Levels I and IV are provided, we can infer the following;

Level 1 (I): Minor Accident Level 2 (II): Moderate Accident Level 3 (III): Major Accident Level 4 (IV): Serious Accident Level 5 (V): Severe Accident Level 6 (VI): Catastrophic Accident Potential Accident Level (Severity) Classification: We infer the following;

Level 1 (I): Low Potential Level 2 (II): Moderate Potential Level 3 (III): High Potential Level 4 (V): Very High Potential Level 5 (V): Extreme Potential Level 6 (VI): Critical Potential

1.2 Process:

Step 1.2: Import the data

1.2.1 Importing of Libraries

In [ ]:
# Importing and installing the necessary libraries
import pandas as pd
!pip install roman
import roman
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
from sklearn.linear_model import LogisticRegression
import plotly.graph_objects as go
from IPython.display import display
import re
import holoviews as hv
from holoviews import opts
!pip install hvplot
import hvplot.pandas
import random  # Import the random module
import seaborn as sns
# pre-processing methods
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import LabelEncoder


!pip install nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import nltk

# Download NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

from wordcloud import WordCloud
from collections import Counter

!pip install openpyxl

import string
Collecting roman
  Downloading roman-4.2-py3-none-any.whl.metadata (3.6 kB)
Downloading roman-4.2-py3-none-any.whl (5.5 kB)
Installing collected packages: roman
Successfully installed roman-4.2
Collecting hvplot
  Downloading hvplot-0.11.1-py3-none-any.whl.metadata (15 kB)
Requirement already satisfied: bokeh>=3.1 in /usr/local/lib/python3.10/dist-packages (from hvplot) (3.6.2)
Requirement already satisfied: colorcet>=2 in /usr/local/lib/python3.10/dist-packages (from hvplot) (3.1.0)
Requirement already satisfied: holoviews>=1.19.0 in /usr/local/lib/python3.10/dist-packages (from hvplot) (1.20.0)
Requirement already satisfied: numpy>=1.21 in /usr/local/lib/python3.10/dist-packages (from hvplot) (1.26.4)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from hvplot) (24.2)
Requirement already satisfied: pandas>=1.3 in /usr/local/lib/python3.10/dist-packages (from hvplot) (2.2.2)
Requirement already satisfied: panel>=1.0 in /usr/local/lib/python3.10/dist-packages (from hvplot) (1.5.4)
Requirement already satisfied: param<3.0,>=1.12.0 in /usr/local/lib/python3.10/dist-packages (from hvplot) (2.1.1)
Requirement already satisfied: Jinja2>=2.9 in /usr/local/lib/python3.10/dist-packages (from bokeh>=3.1->hvplot) (3.1.4)
Requirement already satisfied: contourpy>=1.2 in /usr/local/lib/python3.10/dist-packages (from bokeh>=3.1->hvplot) (1.3.1)
Requirement already satisfied: pillow>=7.1.0 in /usr/local/lib/python3.10/dist-packages (from bokeh>=3.1->hvplot) (11.0.0)
Requirement already satisfied: PyYAML>=3.10 in /usr/local/lib/python3.10/dist-packages (from bokeh>=3.1->hvplot) (6.0.2)
Requirement already satisfied: tornado>=6.2 in /usr/local/lib/python3.10/dist-packages (from bokeh>=3.1->hvplot) (6.3.3)
Requirement already satisfied: xyzservices>=2021.09.1 in /usr/local/lib/python3.10/dist-packages (from bokeh>=3.1->hvplot) (2024.9.0)
Requirement already satisfied: pyviz-comms>=2.1 in /usr/local/lib/python3.10/dist-packages (from holoviews>=1.19.0->hvplot) (3.0.3)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.3->hvplot) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.3->hvplot) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.3->hvplot) (2024.2)
Requirement already satisfied: bleach in /usr/local/lib/python3.10/dist-packages (from panel>=1.0->hvplot) (6.2.0)
Requirement already satisfied: linkify-it-py in /usr/local/lib/python3.10/dist-packages (from panel>=1.0->hvplot) (2.0.3)
Requirement already satisfied: markdown in /usr/local/lib/python3.10/dist-packages (from panel>=1.0->hvplot) (3.7)
Requirement already satisfied: markdown-it-py in /usr/local/lib/python3.10/dist-packages (from panel>=1.0->hvplot) (3.0.0)
Requirement already satisfied: mdit-py-plugins in /usr/local/lib/python3.10/dist-packages (from panel>=1.0->hvplot) (0.4.2)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from panel>=1.0->hvplot) (2.32.3)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from panel>=1.0->hvplot) (4.66.6)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from panel>=1.0->hvplot) (4.12.2)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from Jinja2>=2.9->bokeh>=3.1->hvplot) (3.0.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas>=1.3->hvplot) (1.17.0)
Requirement already satisfied: webencodings in /usr/local/lib/python3.10/dist-packages (from bleach->panel>=1.0->hvplot) (0.5.1)
Requirement already satisfied: uc-micro-py in /usr/local/lib/python3.10/dist-packages (from linkify-it-py->panel>=1.0->hvplot) (1.0.3)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py->panel>=1.0->hvplot) (0.1.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->panel>=1.0->hvplot) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->panel>=1.0->hvplot) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->panel>=1.0->hvplot) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->panel>=1.0->hvplot) (2024.8.30)
Downloading hvplot-0.11.1-py3-none-any.whl (161 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 161.2/161.2 kB 6.3 MB/s eta 0:00:00
Installing collected packages: hvplot
Successfully installed hvplot-0.11.1
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.9.1)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk) (8.1.7)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk) (1.4.2)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk) (2024.9.11)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk) (4.66.6)
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
Requirement already satisfied: openpyxl in /usr/local/lib/python3.10/dist-packages (3.1.5)
Requirement already satisfied: et-xmlfile in /usr/local/lib/python3.10/dist-packages (from openpyxl) (2.0.0)

1.2.2 Load DataSet

In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [ ]:
!ls '/content/drive/MyDrive/AIML_Capstone_Project'
'Data Set Industrial_safety_and_health_database_with_accidents_description.xlsx'
 df_preprocess_10122024.csv
 df_preprocess_12082024.csv
 df_preprocess.csv
 df_trials_09122024.csv
 exported_data_NLP_Chatbot_Industry_Accident.xlsx
 Final_NLP_Glove_df.csv
 Final_NLP_Glove_df.xlsx
 Final_NLP_TFIDF_df.csv
 Final_NLP_TFIDF_df.xlsx
 Final_NLP_Word2Vec_df.csv
 Final_NLP_Word2Vec_df.xlsx
 glove.6B
'Interium Project'
 Intermediate_NLP_Glove_df_update.xlsx
 Intermediate_NLP_Glove_df.xlsx
 Intermediate_NLP_TFIDF_df.xlsx
 Intermediate_NLP_Word2Vec_df.xlsx
In [ ]:
import pandas as pd
df = pd.read_excel('/content/drive/MyDrive/AIML_Capstone_Project/Data Set Industrial_safety_and_health_database_with_accidents_description.xlsx')
In [ ]:
# Get the top 5 rows
display(df.head())
Unnamed: 0 Data Countries Local Industry Sector Accident Level Potential Accident Level Genre Employee or Third Party Critical Risk Description
0 0 2016-01-01 Country_01 Local_01 Mining I IV Male Third Party Pressed While removing the drill rod of the Jumbo 08 f...
1 1 2016-01-02 Country_02 Local_02 Mining I IV Male Employee Pressurized Systems During the activation of a sodium sulphide pum...
2 2 2016-01-06 Country_01 Local_03 Mining I III Male Third Party (Remote) Manual Tools In the sub-station MILPO located at level +170...
3 3 2016-01-08 Country_01 Local_04 Mining I I Male Third Party Others Being 9:45 am. approximately in the Nv. 1880 C...
4 4 2016-01-10 Country_01 Local_04 Mining IV IV Male Third Party Others Approximately at 11:45 a.m. in circumstances t...

Shape of the data

In [ ]:
print("Number of rows = {0} and Number of Columns = {1} in the Data frame".format(df.shape[0], df.shape[1]))
Number of rows = 425 and Number of Columns = 11 in the Data frame

Data type of each attribute

In [ ]:
# Check datatypes
df.dtypes
Out[ ]:
0
Unnamed: 0 int64
Data datetime64[ns]
Countries object
Local object
Industry Sector object
Accident Level object
Potential Accident Level object
Genre object
Employee or Third Party object
Critical Risk object
Description object

From the above output, we see that except first column all other columns datatype is object.

Categorical columns - 'Countries', 'Local', 'Industry Sector', 'Accident Level', 'Potential Accident Level', 'Genre', 'Employee or Third Party', 'Critical Risk', 'Description'

Date column - 'Data'

In [ ]:
# Check Data frame info
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 425 entries, 0 to 424
Data columns (total 11 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   Unnamed: 0                425 non-null    int64         
 1   Data                      425 non-null    datetime64[ns]
 2   Countries                 425 non-null    object        
 3   Local                     425 non-null    object        
 4   Industry Sector           425 non-null    object        
 5   Accident Level            425 non-null    object        
 6   Potential Accident Level  425 non-null    object        
 7   Genre                     425 non-null    object        
 8   Employee or Third Party   425 non-null    object        
 9   Critical Risk             425 non-null    object        
 10  Description               425 non-null    object        
dtypes: datetime64[ns](1), int64(1), object(9)
memory usage: 36.6+ KB
In [ ]:
# Column names of Data frame
df.columns
Out[ ]:
Index(['Unnamed: 0', 'Data', 'Countries', 'Local', 'Industry Sector',
       'Accident Level', 'Potential Accident Level', 'Genre',
       'Employee or Third Party', 'Critical Risk', 'Description'],
      dtype='object')

Step 1 Summary - Data Collection

There are about 425 rows and 11 columns in the dataset. We noticed that except a 'date' column all other columns are categorical columns.

Step 2: Data cleansing

In [ ]:
# Remove 'Unnamed: 0' column from Data frame
df.drop("Unnamed: 0", axis=1, inplace=True)

# Rename 'Data', 'Countries', 'Genre', 'Employee or Third Party' columns in Data frame
df.rename(columns={'Data':'Date','Countries':'Country','Local' : 'City' , 'Genre':'Gender', 'Employee or Third Party':'Employee type'}, inplace=True)

# Get the top 2 rows
df.head(2)
Out[ ]:
Date Country City Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Description
0 2016-01-01 Country_01 Local_01 Mining I IV Male Third Party Pressed While removing the drill rod of the Jumbo 08 f...
1 2016-01-02 Country_02 Local_02 Mining I IV Male Employee Pressurized Systems During the activation of a sodium sulphide pum...
In [ ]:
# Check duplicates in a data frame
df.duplicated().sum()
Out[ ]:
7
In [ ]:
# Delete duplicate rows
df.drop_duplicates(inplace=True)
In [ ]:
# Check the presence of missing values
df.isnull().sum()
Out[ ]:
0
Date 0
Country 0
City 0
Industry Sector 0
Accident Level 0
Potential Accident Level 0
Gender 0
Employee type 0
Critical Risk 0
Description 0

In [ ]:
print("Number of rows = {0} and Number of Columns = {1} in the Data frame after removing the duplicates.".format(df.shape[0], df.shape[1]))
Number of rows = 418 and Number of Columns = 10 in the Data frame after removing the duplicates.

Data Cleansing Summary:

  1. Removed 'Unnamed: 0' column and renamed - 'Data', 'Countries', 'Genre', 'Employee or Third Party' columns in the dataset.
  2. We had 7 duplicate instances in the dataset and dropped those duplicates.
  3. No missing values in dataset.
  4. We are left with 418 rows and 10 columns after data cleansing.

Step 3: Data preprocessing

In [ ]:
# Convert Accident level and Potential Accident Levels from Roman numerals to Numbers

df["Accident Level"] = df["Accident Level"].apply(roman.fromRoman)
df["Potential Accident Level"] = df["Potential Accident Level"].apply(roman.fromRoman)
print(df.head())
        Date     Country      City Industry Sector  Accident Level  \
0 2016-01-01  Country_01  Local_01          Mining               1   
1 2016-01-02  Country_02  Local_02          Mining               1   
2 2016-01-06  Country_01  Local_03          Mining               1   
3 2016-01-08  Country_01  Local_04          Mining               1   
4 2016-01-10  Country_01  Local_04          Mining               4   

   Potential Accident Level Gender         Employee type        Critical Risk  \
0                         4   Male           Third Party              Pressed   
1                         4   Male              Employee  Pressurized Systems   
2                         3   Male  Third Party (Remote)         Manual Tools   
3                         1   Male           Third Party               Others   
4                         4   Male           Third Party               Others   

                                         Description  
0  While removing the drill rod of the Jumbo 08 f...  
1  During the activation of a sodium sulphide pum...  
2  In the sub-station MILPO located at level +170...  
3  Being 9:45 am. approximately in the Nv. 1880 C...  
4  Approximately at 11:45 a.m. in circumstances t...  
In [ ]:
# Convert the columns to the correct data types
df["Date"] = pd.to_datetime(df["Date"])
df["City"] = df["City"].astype("category")
df["Country"] = df["Country"].astype("category")
df["Accident Level"] = df["Accident Level"].astype("category")
df["Potential Accident Level"] = df["Potential Accident Level"].astype("category")
df["Gender"] = df["Gender"].astype("category")
df["Critical Risk"] = df["Critical Risk"].astype("category")
df["Employee type"] = df["Employee type"].astype("category")

# Replaces the value'\nNot applicable' with 'Not applicable' in the 'Critical Risk' column.
df["Critical Risk"] = df["Critical Risk"].replace("\nNot applicable", "Not applicable")

# Print the first few rows of the DataFrame
print(df.head())
        Date     Country      City Industry Sector Accident Level  \
0 2016-01-01  Country_01  Local_01          Mining              1   
1 2016-01-02  Country_02  Local_02          Mining              1   
2 2016-01-06  Country_01  Local_03          Mining              1   
3 2016-01-08  Country_01  Local_04          Mining              1   
4 2016-01-10  Country_01  Local_04          Mining              4   

  Potential Accident Level Gender         Employee type        Critical Risk  \
0                        4   Male           Third Party              Pressed   
1                        4   Male              Employee  Pressurized Systems   
2                        3   Male  Third Party (Remote)         Manual Tools   
3                        1   Male           Third Party               Others   
4                        4   Male           Third Party               Others   

                                         Description  
0  While removing the drill rod of the Jumbo 08 f...  
1  During the activation of a sodium sulphide pum...  
2  In the sub-station MILPO located at level +170...  
3  Being 9:45 am. approximately in the Nv. 1880 C...  
4  Approximately at 11:45 a.m. in circumstances t...  
<ipython-input-16-46bc1fa7e3f6>:12: FutureWarning: The behavior of Series.replace (and DataFrame.replace) with CategoricalDtype is deprecated. In a future version, replace will only be used for cases that preserve the categories. To change the categories, use ser.cat.rename_categories instead.
  df["Critical Risk"] = df["Critical Risk"].replace("\nNot applicable", "Not applicable")
In [ ]:
# Replaces all instances of the value 'Third Party' in the 'Employee or Contractor' column with 'Contractor'.
df["Employee type"] = df["Employee type"].replace("Third Party", "Contractor")
df["Employee type"] = df["Employee type"].replace("Third Party (Remote)", "Contractor (Remote)")

# Print the first few rows of the DataFrame

# Convert numeric values to dates
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
<ipython-input-17-ca41794a0d26>:2: FutureWarning: The behavior of Series.replace (and DataFrame.replace) with CategoricalDtype is deprecated. In a future version, replace will only be used for cases that preserve the categories. To change the categories, use ser.cat.rename_categories instead.
  df["Employee type"] = df["Employee type"].replace("Third Party", "Contractor")
<ipython-input-17-ca41794a0d26>:3: FutureWarning: The behavior of Series.replace (and DataFrame.replace) with CategoricalDtype is deprecated. In a future version, replace will only be used for cases that preserve the categories. To change the categories, use ser.cat.rename_categories instead.
  df["Employee type"] = df["Employee type"].replace("Third Party (Remote)", "Contractor (Remote)")

To better understand the data, I am extracting the day, month and year from Date column and creating new features such as weekday, weekofyear.

In [ ]:
df['Date'] = pd.to_datetime(df['Date'])

df['Year'] = df.Date.apply(lambda x : x.year)
df['Month'] = df.Date.apply(lambda x : x.month)
df['Day'] = df.Date.apply(lambda x : x.day)
df['Weekday'] = df.Date.apply(lambda x : x.day_name())
df['WeekofYear'] =df.Date.apply(lambda x : x.weekofyear)

df.head()
Out[ ]:
Date Country City Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Description Year Month Day Weekday WeekofYear
0 2016-01-01 Country_01 Local_01 Mining 1 4 Male Contractor Pressed While removing the drill rod of the Jumbo 08 f... 2016 1 1 Friday 53
1 2016-01-02 Country_02 Local_02 Mining 1 4 Male Employee Pressurized Systems During the activation of a sodium sulphide pum... 2016 1 2 Saturday 53
2 2016-01-06 Country_01 Local_03 Mining 1 3 Male Contractor (Remote) Manual Tools In the sub-station MILPO located at level +170... 2016 1 6 Wednesday 1
3 2016-01-08 Country_01 Local_04 Mining 1 1 Male Contractor Others Being 9:45 am. approximately in the Nv. 1880 C... 2016 1 8 Friday 1
4 2016-01-10 Country_01 Local_04 Mining 4 4 Male Contractor Others Approximately at 11:45 a.m. in circumstances t... 2016 1 10 Sunday 1

Step3.1 Statistical Analysis

The next step was to perform a statistical analysis of the data. This involved generating a report of the statistical analysis, which includes the following information:

  • The descriptive statistics for the numerical columns

FrequencyDistribution ¶

  • This section we would determine the Frequency Distribution for our dataset columns. The frequency distribution shows how many times each value in the column appears.
In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 418 entries, 0 to 424
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   Date                      418 non-null    datetime64[ns]
 1   Country                   418 non-null    category      
 2   City                      418 non-null    category      
 3   Industry Sector           418 non-null    object        
 4   Accident Level            418 non-null    category      
 5   Potential Accident Level  418 non-null    category      
 6   Gender                    418 non-null    category      
 7   Employee type             418 non-null    category      
 8   Critical Risk             418 non-null    category      
 9   Description               418 non-null    object        
 10  Year                      418 non-null    int64         
 11  Month                     418 non-null    int64         
 12  Day                       418 non-null    int64         
 13  Weekday                   418 non-null    object        
 14  WeekofYear                418 non-null    int64         
dtypes: category(7), datetime64[ns](1), int64(4), object(3)
memory usage: 34.7+ KB
In [ ]:
# Calculate the frequency distribution for the categorical columns
for column in df.select_dtypes(include=["object", "category"]):
    print(column, df[column].value_counts())
Country Country
Country_01    248
Country_02    129
Country_03     41
Name: count, dtype: int64
City City
Local_03    89
Local_05    59
Local_01    56
Local_04    55
Local_06    46
Local_10    41
Local_08    27
Local_02    23
Local_07    14
Local_12     4
Local_09     2
Local_11     2
Name: count, dtype: int64
Industry Sector Industry Sector
Mining    237
Metals    134
Others     47
Name: count, dtype: int64
Accident Level Accident Level
1    309
2     40
3     31
4     30
5      8
Name: count, dtype: int64
Potential Accident Level Potential Accident Level
4    141
3    106
2     95
1     45
5     30
6      1
Name: count, dtype: int64
Gender Gender
Male      396
Female     22
Name: count, dtype: int64
Employee type Employee type
Contractor             185
Employee               178
Contractor (Remote)     55
Name: count, dtype: int64
Critical Risk Critical Risk
Others                                       229
Pressed                                       24
Manual Tools                                  20
Chemical substances                           17
Cut                                           14
Venomous Animals                              13
Projection                                    13
Bees                                          10
Fall                                           9
Vehicles and Mobile Equipment                  8
Fall prevention (same level)                   7
remains of choco                               7
Pressurized Systems                            7
Fall prevention                                6
Suspended Loads                                6
Pressurized Systems / Chemical Substances      3
Blocking and isolation of energies             3
Liquid Metal                                   3
Power lock                                     3
Electrical Shock                               2
Machine Protection                             2
Not applicable                                 1
Burn                                           1
Confined space                                 1
Electrical installation                        1
Individual protection equipment                1
Projection of fragments                        1
Poll                                           1
Plates                                         1
Projection/Manual Tools                        1
Projection/Choco                               1
Projection/Burning                             1
Traffic                                        1
Name: count, dtype: int64
Description Description
During the activity of chuteo of ore in hopper OP5; the operator of the locomotive parks his equipment under the hopper to fill the first car, it is at this moment that when it was blowing out to release the load, a mud flow suddenly appears with the presence of rock fragments; the personnel that was in the direction of the flow was covered with mud.                                                                                                                                                                                                                                                                                                                                                                                      2
The employees Márcio and Sérgio performed the pump pipe clearing activity FZ1.031.4 and during the removal of the suction spool flange bolts, there was projection of pulp over them causing injuries.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                2
In the geological reconnaissance activity, in the farm of Mr. Lázaro, the team composed by Felipe and Divino de Morais, in normal activity encountered a ciliary forest, as they needed to enter the forest to verify a rock outcrop which was the front, the Divine realized the opening of the access with machete. At that moment, took a bite from his neck. There were no more attacks, no allergic reaction, and continued work normally. With the work completed, leaving the forest for the same access, the Divine assistant was attacked by a snake and suffered a sting in the forehead. At that moment they moved away from the area. It was verified that there was no type of allergic reaction and returned with normal activities.    2
At moments when the MAPERU truck of plate F1T 878, returned from the city of Pasco to the Unit transporting a consultant, being 350 meters from the main gate his lane is invaded by a civilian vehicle, making the driver turn sharply to the side right where was staff of the company IMPROMEC doing hot melt work in an 8 "pipe impacting two collaborators causing the injuries described At the time of the accident the truck was traveling at 37km / h - according to INTHINC -, the width of the road is of 6 meters, the activity had safety cones as a warning on both sides of the road and employees used their respective EPP'S.                                                                                                        2
When starting the activity of removing a coil of electric cables in the warehouse with the help of forklift truck the operator did not notice that there was a beehive in it. Due to the movement of the coil the bees were excited. Realizing the fact the operator turned off the equipment and left the area. People passing by were stung.                                                                                                                                                                                                                                                                                                                                                                                                        2
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     ..
Being 01:50 p.m. approximately, in the Nv. 1800, in the Tecnomin winery. Mr. Chagua - Bodeguero was alone, cutting wires No. 16 with a grinder, previously he had removed the protection guard from the disk of 4 inches in diameter and adapted a disk of a crosscutter of approximately 8 inches. Originating traumatic amputation of two fingers of the left hand                                                                                                                                                                                                                                                                                                                                                                                  1
In circumstances that the collaborator performed the cleaning of the ditch 3570, 0.50 cm deep, removing the pipe of 2 "HDPE material with an estimated weight of 30 Kg. Together with two collaborators, when pushing the tube to drain the dune, the collaborator is hit on the lower right side lip producing a slight blow to the lip. At the time of the event, the collaborator had a safety helmet, glasses and gloves.                                                                                                                                                                                                                                                                                                                         1
During the process of washing the material (Becker), the tip of the material was broken which caused a cut of the 5th finger of the right hand                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        1
The clerk was peeling and pulling a sheet came another one that struck in his 5th chirodactile of the left hand tearing his PVC sleeve caused a cut.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  1
Once the mooring of the faneles in the detonating cord has been completed, the injured person proceeds to tie the detonating cord in the safety guide (slow wick) at a distance of 2.0 meters from the top of the work. At that moment, to finish mooring, a rock bank (30cm x 50cm x 15cm; 67.5 Kg.) the same front, from a height of 1.60 meters, which falls to the floor very close to the injured, disintegrates in several fragments, one of which (12cmx10cmx3cm, 2.0 Kg.) slides between the fragments of rock and impacts with the left leg of the victim. At the time of the accident the operator used his safety boots and was accompanied by a supervisor.                                                                               1
Name: count, Length: 411, dtype: int64
Weekday Weekday
Thursday     76
Tuesday      69
Wednesday    62
Friday       61
Saturday     56
Monday       53
Sunday       41
Name: count, dtype: int64

Analysis

The following table summarizes the results of the accident analysis above:

  • Country with the most reported accident cases Country_01 248
  • Location with the most reported accident cases Local_03 89
  • Industry Sector with the most reported accident cases Mining 237
  • Accident Level with the most reported cases Minor 309
  • Potential Accident Level with the most reported cases 4, Very High Potential 141
  • Gender with the most reported cases Male 396
  • Employment status with the most reported cases Contractor 185
  • Critical Risk with the most reported accident cases Others 229

Step 3.1.1 Descriptive Analysis Report

Univariate Anlaysis

Pie Chart: Gender Distribution

In [ ]:
print('--'*30); print('Value Counts for `Gender` label'); print('--'*30)
# Total row count in the dataset
total_row_cnt = len(df)
Male_cnt = df[df['Gender'] == 'Male'].shape[0]
Female_cnt = df[df['Gender'] == 'Female'].shape[0]

print(f'Male count: {Male_cnt} i.e. {round(Male_cnt/total_row_cnt*100, 0)}%')
print(f'Female count: {Female_cnt} i.e. {round(Female_cnt/total_row_cnt*100, 0)}%')

print('--'*30); print('Distributon of `Gender` label'); print('--'*30)

gender_cnt = np.round(df['Gender'].value_counts(normalize=True) * 100)

hv.Bars(gender_cnt).opts(title="Gender Count", color="#98FB98", xlabel="Gender", ylabel="Percentage", yformatter='%d%%')\
                .opts(opts.Bars(width=500, height=300,tools=['hover'],show_grid=True))
------------------------------------------------------------
Value Counts for `Gender` label
------------------------------------------------------------
Male count: 396 i.e. 95.0%
Female count: 22 i.e. 5.0%
------------------------------------------------------------
Distributon of `Gender` label
------------------------------------------------------------
Out[ ]:

Bar Chart: Accident Distribution by Country

In [ ]:
# Plot the distribution of Accidents by Country
country = df["Country"].value_counts()
# Increase the size of the chart
plt.figure(figsize=(4, 8))
plt.bar(country.index, country.values)
plt.title("Distribution of Accidents by Country")
plt.show()

Pie Chart: Accident Distribution by Industry Sector

In [ ]:
# Plot the distribution of Accidents by Industry Sector
industry_sectors = df["Industry Sector"].value_counts()
# Increase the size of the chart
plt.figure(figsize=(12, 8))
# Convert the data to percentages
percentages = 100 * industry_sectors / industry_sectors.sum()
# Create a pie chart
plt.pie(percentages, labels=industry_sectors.index, autopct="%.1f%%")
plt.title("Distribution of Accidents by Industry")
print()
print()
plt.show()

Bar Chart: Distribution of Accidents by City

In [ ]:
# @title City Distribution

# Calculate counts and percentages
counts = df.groupby('City').size().sort_values(ascending=True)
total = counts.sum()
percentages = (counts / total * 100).round(2)

# Create bar plot
ax = counts.plot(kind='barh', color=sns.palettes.mpl_palette('Dark2'))
plt.gca().spines[['top', 'right']].set_visible(False)

# Add count and percentage labels to bars
for i, (count, percentage) in enumerate(zip(counts, percentages)):
  ax.text(count + 5, i, f'{count} ({percentage}%)', va='center')

plt.xlabel('Count')
plt.show()
<ipython-input-24-71978880bc2d>:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  counts = df.groupby('City').size().sort_values(ascending=True)

Employee Type Distribution

In [ ]:
print('--'*30); print('Value Counts for `Employee type` label'); print('--'*30)

third_party_cnt = df[df['Employee type'] == 'Third Party'].shape[0]
emp_cnt = df[df['Employee type'] == 'Employee'].shape[0]
third_rem_cnt = df[df['Employee type'] == 'Third Party (Remote)'].shape[0]

print(f'Third Party count: {third_party_cnt} i.e. {round(third_party_cnt/total_row_cnt*100, 0)}%')
print(f'Employee count: {emp_cnt} i.e. {round(emp_cnt/total_row_cnt*100, 0)}%')
print(f'Third Party (Remote) count: {third_rem_cnt} i.e. {round(third_rem_cnt/total_row_cnt*100, 0)}%')

print('--'*30); print('Distributon of `Employee type` label'); print('--'*30)

emp_type_cnt = np.round(df['Employee type'].value_counts(normalize=True) * 100)

hv.Bars(emp_type_cnt).opts(title="Employee type Count", color="#228B22", xlabel="Employee Type", ylabel="Percentage", yformatter='%d%%')\
                .opts(opts.Bars(width=500, height=300,tools=['hover'],show_grid=True))
------------------------------------------------------------
Value Counts for `Employee type` label
------------------------------------------------------------
Third Party count: 0 i.e. 0.0%
Employee count: 178 i.e. 43.0%
Third Party (Remote) count: 0 i.e. 0.0%
------------------------------------------------------------
Distributon of `Employee type` label
------------------------------------------------------------
Out[ ]:
In [ ]:
# @title Critical Risk Distribution

# Calculate counts and percentages
counts = df.groupby('Critical Risk').size().sort_values(ascending=True)
total = counts.sum()
percentages = (counts / total * 100).round(2)

# Create bar plot
plt.figure(figsize=(10, 10))  # Adjust figure size as needed
ax = counts.plot(kind='barh', color=sns.palettes.mpl_palette('Dark2'))
plt.gca().spines[['top', 'right']].set_visible(False)

# Add count and percentage labels to bars
for i, (count, percentage) in enumerate(zip(counts, percentages)):
  ax.text(count + 5, i, f'{count} ({percentage}%)', va='center')

plt.xlabel('Count')
plt.title('Critical Risk Distribution')
plt.show()
<ipython-input-26-f81c8271a58a>:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  counts = df.groupby('Critical Risk').size().sort_values(ascending=True)

Bivariate Analaysis

Accident Levels

In [ ]:
print('--'*30); print('Value Counts for `Accident Level` label'); print('--'*40)
total_row_cnt = df.shape[0]
# Convert 'Accident Level' and 'Potential Accident Level' columns to strings, replacing NaN with empty strings
# Convert 'Accident Level' and 'Potential Accident Level' columns to strings directly
df['Accident Level'] = df['Accident Level'].astype(str).str.strip()
df['Potential Accident Level'] = df['Potential Accident Level'].astype(str).str.strip()

Level_1_acc_cnt = df[df['Accident Level'] == '1'].shape[0]
Level_2_acc_cnt = df[df['Accident Level'] == '2'].shape[0]
Level_3_acc_cnt = df[df['Accident Level'] == '3'].shape[0]
Level_4_acc_cnt = df[df['Accident Level'] == '4'].shape[0]
Level_5_acc_cnt = df[df['Accident Level'] == '5'].shape[0]
Level_6_acc_cnt = df[df['Accident Level'] == '6'].shape[0]

print(f'Accident Level - 1 count: {Level_1_acc_cnt} i.e. {round(Level_1_acc_cnt/total_row_cnt*100, 0)}%')
print(f'Accident Level - 2 count: {Level_2_acc_cnt} i.e. {round(Level_2_acc_cnt/total_row_cnt*100, 0)}%')
print(f'Accident Level - 3 count: {Level_3_acc_cnt} i.e. {round(Level_3_acc_cnt/total_row_cnt*100, 0)}%')
print(f'Accident Level - 4 count: {Level_4_acc_cnt} i.e. {round(Level_4_acc_cnt/total_row_cnt*100, 0)}%')
print(f'Accident Level - 5 count: {Level_5_acc_cnt} i.e. {round(Level_5_acc_cnt/total_row_cnt*100, 0)}%')
print(f'Accident Level - 6 count: {Level_6_acc_cnt} i.e. {round(Level_6_acc_cnt/total_row_cnt*100, 0)}%')

print('--'*30); print('Value Counts for `Potential Accident Level'); print('--'*40)

Level_1_pot_acc_cnt = df[df['Potential Accident Level'] == '1'].shape[0]
Level_2_pot_acc_cnt = df[df['Potential Accident Level'] == '2'].shape[0]
Level_3_pot_acc_cnt = df[df['Potential Accident Level'] == '3'].shape[0]
Level_4_pot_acc_cnt = df[df['Potential Accident Level'] == '4'].shape[0]
Level_5_pot_acc_cnt = df[df['Potential Accident Level'] == '5'].shape[0]
Level_6_pot_acc_cnt = df[df['Potential Accident Level'] == '6'].shape[0]

print(f'Potential Accident Level - 1 count: {Level_1_pot_acc_cnt} i.e. {round(Level_1_pot_acc_cnt/total_row_cnt*100, 0)}%')
print(f'Potential Accident Level - 2 count: {Level_2_pot_acc_cnt} i.e. {round(Level_2_pot_acc_cnt/total_row_cnt*100, 0)}%')
print(f'Potential Accident Level - 3 count: {Level_3_pot_acc_cnt} i.e. {round(Level_3_pot_acc_cnt/total_row_cnt*100, 0)}%')
print(f'Potential Accident Level - 4 count: {Level_4_pot_acc_cnt} i.e. {round(Level_4_pot_acc_cnt/total_row_cnt*100, 0)}%')
print(f'Potential Accident Level - 5 count: {Level_5_pot_acc_cnt} i.e. {round(Level_5_pot_acc_cnt/total_row_cnt*100, 0)}%')
print(f'Potential Accident Level - 6 count: {Level_6_pot_acc_cnt} i.e. {round(Level_6_pot_acc_cnt/total_row_cnt*100, 0)}%')

print('--'*30); print('Distributon of `Accident Level` & `Potential Accident Level` label'); print('--'*40)
# Ensure 'Accident Level' and 'Potential Accident Level' columns are strings
df['Accident Level'] = df['Accident Level'].astype(str).str.strip()
df['Potential Accident Level'] = df['Potential Accident Level'].astype(str).str.strip()

# Calculate percentage distributions for each level
ac_level_cnt = np.round(df['Accident Level'].value_counts(normalize=True) * 100, 1)
pot_ac_level_cnt = np.round(df['Potential Accident Level'].value_counts(normalize=True) * 100, 1)

# Combine into a DataFrame and rename columns
ac_pot = pd.DataFrame({'Accident': ac_level_cnt, 'Potential': pot_ac_level_cnt}).fillna(0)

# Reset index and melt the DataFrame for plotting
ac_pot = ac_pot.reset_index().melt(id_vars='index', value_vars=['Accident', 'Potential'])
ac_pot.columns = ['Severity', 'Level', 'Percentage']


# Updated bar plot code with a green color palette
palette = ["#98FB98", "#3CB371", "#228B22", "#006400"]
plt.figure(figsize=(10, 6))
ax = sns.barplot(x='Severity', y='Percentage', hue='Level', data=ac_pot, palette=palette)

# Add labels to each bar
for container in ax.containers:
    ax.bar_label(container, fmt='%.1f%%', label_type='edge', padding=3)

plt.title('Distribution of Accident Level & Potential Accident Level')
plt.xlabel('Severity')
plt.ylabel('Percentage')
plt.legend(title='Level')
plt.show()
------------------------------------------------------------
Value Counts for `Accident Level` label
--------------------------------------------------------------------------------
Accident Level - 1 count: 309 i.e. 74.0%
Accident Level - 2 count: 40 i.e. 10.0%
Accident Level - 3 count: 31 i.e. 7.0%
Accident Level - 4 count: 30 i.e. 7.0%
Accident Level - 5 count: 8 i.e. 2.0%
Accident Level - 6 count: 0 i.e. 0.0%
------------------------------------------------------------
Value Counts for `Potential Accident Level
--------------------------------------------------------------------------------
Potential Accident Level - 1 count: 45 i.e. 11.0%
Potential Accident Level - 2 count: 95 i.e. 23.0%
Potential Accident Level - 3 count: 106 i.e. 25.0%
Potential Accident Level - 4 count: 141 i.e. 34.0%
Potential Accident Level - 5 count: 30 i.e. 7.0%
Potential Accident Level - 6 count: 1 i.e. 0.0%
------------------------------------------------------------
Distributon of `Accident Level` & `Potential Accident Level` label
--------------------------------------------------------------------------------
<ipython-input-27-a4e5193cf1ab>:58: UserWarning: The palette list has more values (4) than needed (2), which may not be intended.
  ax = sns.barplot(x='Severity', y='Percentage', hue='Level', data=ac_pot, palette=palette)
In [ ]:
# @title Accident Level and Potential Accident Level vs Gender

import matplotlib.pyplot as plt
import seaborn as sns

# Create a figure and axes
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Plot Accident Level vs Gender
sns.countplot(x='Accident Level', hue='Gender', data=df, ax=axes[0], palette='Set2')
axes[0].set_title('Accident Level vs Gender')

# Plot Potential Accident Level vs Gender
sns.countplot(x='Potential Accident Level', hue='Gender', data=df, ax=axes[1], palette='Set2')
axes[1].set_title('Potential Accident Level vs Gender')

# Rotate x-axis labels for better readability
plt.setp(axes[0].get_xticklabels(), rotation=0)
plt.setp(axes[1].get_xticklabels(), rotation=0)

# Adjust layout and display the plot
plt.tight_layout()
plt.show()

Observations: Accident Level vs Gender:

A significantly higher number of males are involved in accidents across all accident levels. The disparity is particularly pronounced in lower accident levels (I and II). Potential Accident Level vs Gender:

Similar to the actual accident level, males are more likely to be involved in potential accidents. The difference in potential accident levels between genders is less pronounced compared to actual accidents, suggesting that preventive measures might be more effective for males.

Employee Type Vs Accident Level Distribution

In [ ]:
# Create a figure and axes
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Plot Accident Level vs Gender
sns.countplot(x='Accident Level', hue='Employee type', data=df, ax=axes[0], palette='Set2')
axes[0].set_title('Accident Level vs Employee Type')

# Plot Potential Accident Level vs Gender
sns.countplot(x='Potential Accident Level', hue='Employee type', data=df, ax=axes[1], palette='Set2')
axes[1].set_title('Potential Accident Level vs Employee Type')

# Rotate x-axis labels for better readability
plt.setp(axes[0].get_xticklabels(), rotation=0)
plt.setp(axes[1].get_xticklabels(), rotation=0)

# Adjust layout and display the plot
plt.tight_layout()
plt.show()

Observations: Accident Level vs Employee Type:

Employees are involved in a significantly higher number of accidents across all accident levels compared to third parties. The difference is particularly pronounced in lower accident levels (I and II). Potential Accident Level vs Employee Type:

Similar to the actual accident level, employees are more likely to be involved in potential accidents compared to third parties. The difference in potential accident levels between employee types is less pronounced compared to actual accidents. This suggests that preventive measures might be more effective for employees.

Distribution of Accidenets by Year and Month

In [ ]:
# @title Accident Level and Potential Accident Over Years and Months

# Extract year and month from the 'Date' column
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month

# Plot Accident Level and Potential Accident Level against Year
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
sns.countplot(x='Year', hue='Accident Level', data=df, ax=axes[0], palette='Set2')
axes[0].set_title('Accident Level vs Year')
sns.countplot(x='Year', hue='Potential Accident Level', data=df , ax=axes[1], palette='Set2')
axes[1].set_title('Potential Accident Level vs Year')
plt.tight_layout()
plt.show()

# Plot Accident Level and Potential Accident Level against Month
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
sns.countplot(x='Month', hue='Accident Level', data=df, ax=axes[0], palette='Set2')
axes[0].set_title('Accident Level vs Month')
sns.countplot(x='Month', hue='Potential Accident Level', data=df, ax=axes[1], palette='Set2')
axes[1].set_title('Potential Accident Level vs Month')
plt.tight_layout()
plt.show()

Observations: Accident Level vs Year:

There's a noticeable decrease in the number of accidents across all levels in the later years compared to the initial years. This suggests a positive trend in terms of safety improvements over time. Potential Accident Level vs Year:

Similar to the actual accident level, potential accidents also show a decreasing trend over the years. This indicates that preventive measures and safety protocols might be becoming more effective in mitigating potential risks. Accident Level vs Month:

There's some variation in accident counts across different months, but no clear seasonal pattern emerges. Further analysis might be needed to identify potential factors influencing these monthly fluctuations. Potential Accident Level vs Month:

Similar to the actual accident level, potential accidents also show some monthly variation without a distinct seasonal pattern. This suggests that the factors influencing accident occurrences might not be strongly tied to specific months.

In [ ]:
# @title Monthly Frequency of Accidents Over Years

# Group by year and month and count accidents
monthly_accidents = df.groupby(['Year', 'Month'])['Date'].count().reset_index(name='Accident Count')

# Pivot the table for plotting
monthly_accidents_pivot = monthly_accidents.pivot(index='Month', columns='Year', values='Accident Count')

# Plot the monthly accident frequency for each year
plt.figure(figsize=(10, 6))
monthly_accidents_pivot.plot(kind='line', marker='o')
plt.title('Monthly Frequency of Accidents Over Years', fontsize=12)
plt.xlabel('Month', fontsize=12)
plt.ylabel('Number of Accidents', fontsize=12)
plt.xticks(range(1, 13))  # Set x-axis ticks to represent months
plt.legend(title='Year', loc='upper right')
plt.grid(False, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
<ipython-input-31-0b5acb1dc3e9>:17: UserWarning: First parameter to grid() is false, but line properties are supplied. The grid will be enabled.
  plt.grid(False, linestyle='--', alpha=0.7)
<Figure size 1000x600 with 0 Axes>

Observations: Overall Trend:

There appears to be a general downward trend in the number of accidents over the years. This could suggest that safety measures or interventions implemented over time are having a positive impact. Seasonal Variations:

There might be some seasonal variations in accident frequency. For example, there seems to be a slight increase in accidents around the middle of the year (months 5-7) in some years. This could be related to factors like weather conditions, workload, or specific activities happening during those months. Year-to-Year Fluctuations:

While the overall trend is downward, there are fluctuations in accident counts from year to year. This highlights the need for continuous monitoring and adjustment of safety protocols to address specific challenges that might arise in different periods. Further Analysis:

To gain deeper insights, it would be helpful to analyze the specific causes of accidents in different months and years. This could reveal patterns or contributing factors that can be targeted for further improvement.

In [ ]:
# Define the custom order for weekdays
weekday_order = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']

# Convert Weekday to a categorical type with the custom order
df['Weekday'] = pd.Categorical(df['Weekday'], categories=weekday_order, ordered=True)

# Plot distributions
fig, ax = plt.subplots(1, 5, figsize=(20, 10))

for i, col in enumerate(['Year', 'Month', 'Day', 'Weekday', 'WeekofYear']):
    sns.countplot(y=df[col].astype('category'), ax=ax[i], order=df[col].cat.categories if col == 'Weekday' else None)

plt.tight_layout()
plt.show()
In [ ]:
# @title Date vs Potential Accident Level count()

from matplotlib import pyplot as plt
import seaborn as sns
def _plot_series(series, series_name, series_index=0):
  palette = list(sns.palettes.mpl_palette('Dark2'))
  counted = (series['Date']
                .value_counts()
              .reset_index(name='counts')
              .rename({'index': 'Date'}, axis=1)
              .sort_values('Date', ascending=True))
  xs = counted['Date']
  ys = counted['counts']
  plt.plot(xs, ys, label=series_name, color=palette[series_index % len(palette)])

fig, ax = plt.subplots(figsize=(15, 5), layout='constrained')
df_sorted = df.sort_values('Date', ascending=True)
for i, (series_name, series) in enumerate(df_sorted.groupby('Potential Accident Level')):
  _plot_series(series, series_name, i)
  fig.legend(title='Potential Accident Level', bbox_to_anchor=(1, 1), loc='upper left')
sns.despine(fig=fig, ax=ax)
plt.xlabel('Date')
_ = plt.ylabel('count()')

Observations: Trend Over Time:

There is no clear long-term increasing or decreasing trend in the number of accidents for any potential accident level. The counts fluctuate over time, indicating potential seasonality or other factors influencing accident occurrences. Potential Accident Level IV:

It consistently shows a lower number of accidents compared to other levels. This suggests that accidents with a high potential severity (level IV) are relatively less frequent. Fluctuations and Peaks:

There are noticeable fluctuations in the counts for all potential accident levels. Some periods show peaks in accident occurrences, which might be related to specific events, seasonal changes, or other external factors. No Clear Pattern:

There is no consistent pattern in the relationship between the date and the number of accidents for any potential accident level. This suggests that the occurrence of accidents might be influenced by multiple factors that interact in complex ways.

In [ ]:
# @title Date vs Accident Level count()

from matplotlib import pyplot as plt
import seaborn as sns
def _plot_series(series, series_name, series_index=0):
  palette = list(sns.palettes.mpl_palette('Dark2'))
  counted = (series['Date']
                .value_counts()
              .reset_index(name='counts')
              .rename({'index': 'Date'}, axis=1)
              .sort_values('Date', ascending=True))
  xs = counted['Date']
  ys = counted['counts']
  plt.plot(xs, ys, label=series_name, color=palette[series_index % len(palette)])

fig, ax = plt.subplots(figsize=(15, 5), layout='constrained')
df_sorted = df.sort_values('Date', ascending=True)
for i, (series_name, series) in enumerate(df_sorted.groupby('Accident Level')):
  _plot_series(series, series_name, i)
  fig.legend(title='Accident Level', bbox_to_anchor=(1, 1), loc='upper left')
sns.despine(fig=fig, ax=ax)
plt.xlabel('Date')
_ = plt.ylabel('count()')

Observations: Trend Over Time:

There is no clear long-term increasing or decreasing trend in the number of accidents for any accident level. The counts fluctuate over time, indicating potential seasonality or other factors influencing accident occurrences. Accident Levels I and II:

These levels consistently show a higher number of accidents compared to other levels. This suggests that minor accidents (levels I and II) are more frequent. Fluctuations and Peaks:

There are noticeable fluctuations in the counts for all accident levels. Some periods show peaks in accident occurrences, which might be related to specific events, seasonal changes, or other external factors. No Clear Pattern:

There is no consistent pattern in the relationship between the date and the number of accidents for any accident level. This suggests that the occurrence of accidents might be influenced by multiple factors that interact in complex ways.

In [ ]:
# Countplot

# Custom Spectral palette with enough colors for all months
unique_months = df['Month'].nunique()
palette = sns.color_palette("Spectral", unique_months)

sns.countplot(data=df, x='Accident Level', hue='Month', palette=palette)
plt.legend(title='Month', bbox_to_anchor=(1.05, 1), loc='upper left')  # Adjust legend position
plt.show()
In [ ]:
# @title Date vs Industry Sector count()

from matplotlib import pyplot as plt
import seaborn as sns
def _plot_series(series, series_name, series_index=0):
  palette = list(sns.palettes.mpl_palette('Dark2'))
  counted = (series['Date']
                .value_counts()
              .reset_index(name='counts')
              .rename({'index': 'Date'}, axis=1)
              .sort_values('Date', ascending=True))
  xs = counted['Date']
  ys = counted['counts']
  plt.plot(xs, ys, label=series_name, color=palette[series_index % len(palette)])

fig, ax = plt.subplots(figsize=(15, 5), layout='constrained')
df_sorted = df.sort_values('Date', ascending=True)
for i, (series_name, series) in enumerate(df_sorted.groupby('Industry Sector')):
  _plot_series(series, series_name, i)
  fig.legend(title='Industry Sector', bbox_to_anchor=(1, 1), loc='upper left')
sns.despine(fig=fig, ax=ax)
plt.xlabel('Date')
_ = plt.ylabel('count()')

Observations: Mining Sector:

The Mining sector consistently shows a higher number of accidents compared to other sectors throughout the observed period. This indicates that the Mining industry faces a greater risk of accidents compared to other sectors. Fluctuations and Peaks:

All sectors experience fluctuations in the number of accidents over time. Some periods show peaks in accident occurrences, suggesting potential seasonal variations or other external factors influencing accident rates. Other Sectors:

Sectors like Metals, Others, and Chemicals show relatively lower but still significant numbers of accidents. The fluctuations in these sectors also suggest the influence of external factors on accident occurrences. No Clear Trend:

There is no consistent long-term increasing or decreasing trend in the number of accidents for any sector. This indicates that accident occurrences are likely influenced by multiple interacting factors. Importance of Sector-Specific Analysis:

The plot highlights the importance of analyzing accident trends within each sector separately. This allows for a more targeted understanding of the factors contributing to accidents and the development of sector-specific safety interventions.

In [ ]:
# @title Date vs Country count()

from matplotlib import pyplot as plt
import seaborn as sns
def _plot_series(series, series_name, series_index=0):
  palette = list(sns.palettes.mpl_palette('Dark2'))
  counted = (series['Date']
                .value_counts()
              .reset_index(name='counts')
              .rename({'index': 'Date'}, axis=1)
              .sort_values('Date', ascending=True))
  xs = counted['Date']
  ys = counted['counts']
  plt.plot(xs, ys, label=series_name, color=palette[series_index % len(palette)])

fig, ax = plt.subplots(figsize=(15, 5), layout='constrained')
df_sorted = df.sort_values('Date', ascending=True)
for i, (series_name, series) in enumerate(df_sorted.groupby('Country')):
  _plot_series(series, series_name, i)
  fig.legend(title='Country', bbox_to_anchor=(1, 1), loc='upper left')
sns.despine(fig=fig, ax=ax)
plt.xlabel('Date')
_ = plt.ylabel('count()')
<ipython-input-37-25338957823e>:18: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  for i, (series_name, series) in enumerate(df_sorted.groupby('Country')):

Observations: Country_01:

Consistently shows the highest number of accidents throughout the observed period. This indicates a higher overall accident rate in Country_01 compared to the other two countries. Fluctuations and Peaks:

All countries experience fluctuations in the number of accidents over time. Some periods show peaks in accident occurrences, suggesting potential seasonal variations, specific events, or other external factors influencing accident rates. Country_02 and Country_03:

These countries generally show lower accident counts compared to Country_01. However, they also experience fluctuations and occasional peaks in accident occurrences. No Clear Trend:

There is no consistent long-term increasing or decreasing trend in the number of accidents for any country. This suggests that accident occurrences are likely influenced by multiple interacting factors. Country-Specific Factors:

The plot highlights the importance of considering country-specific factors when analyzing accident trends. These factors could include differences in safety regulations, industry practices, cultural attitudes towards safety, and other socio-economic factors.

In [ ]:
# Remove 'Year' and 'Month' columns from the dataframe
df = df.drop(['Year', 'Month'], axis=1)
In [ ]:
# @title Accident Level vs Potential Accident Level

# Create a cross-tabulation of Accident Level and Potential Accident Level
df_2dhist = pd.DataFrame({
    x_label: grp['Potential Accident Level'].value_counts()
    for x_label, grp in df.groupby('Accident Level')
})

# Plot a heatmap
plt.figure(figsize=(9, 8))
sns.heatmap(df_2dhist, annot=True, cmap='Set3')
plt.title('Relationship between Accident Level and Potential Accident Level')
plt.xlabel('Potential Accident Level')
plt.ylabel('Accident Level')
plt.show()

Observations: Diagonal Dominance:

The heatmap shows a strong diagonal dominance, indicating a positive correlation between Accident Level and Potential Accident Level. This implies that accidents with a higher actual severity level are also more likely to have a higher potential severity level. Potential for Worse Outcomes:

There are significant off-diagonal values, especially above the diagonal. This suggests that many accidents that resulted in lower actual severity levels had the potential to be much worse. Preventive Measures:

The difference between actual and potential severity highlights the importance of preventive measures and safety protocols. These measures likely played a role in preventing many accidents from escalating to their full potential severity. Focus Areas for Improvement:

The heatmap can help identify areas where safety measures can be further improved. For example, focusing on accidents with high potential severity but lower actual severity can lead to more effective prevention strategies.

In [ ]:
# @title Industry Sector vs Accident Level

# Group the data by Industry Sector and Accident Level, counting occurrences
grouped_data = df.groupby(['Industry Sector', 'Accident Level'])['Accident Level'].count().unstack().fillna(0)

# Plot a stacked bar chart
grouped_data.plot(kind='bar', stacked=True, figsize=(8, 6),cmap='Set3')
plt.title('Industry Sector vs Accident Level')
plt.xlabel('Industry Sector')
plt.ylabel('Number of Accidents')
plt.xticks(rotation=0, ha='right')
plt.legend(title='Accident Level')
plt.tight_layout()
plt.show()

Observations: Mining Sector:

The Mining sector stands out with the highest number of accidents across all severity levels. This suggests that the mining industry poses a significant risk to worker safety. Other Sectors:

Other sectors like Metals, Others, and Chemicals also show a considerable number of accidents, particularly at lower severity levels. Severity Distribution:

Across all sectors, the majority of accidents fall under Level I and Level II, indicating that most incidents are relatively minor. However, the presence of higher-level accidents (Levels III to VI) emphasizes the need for safety measures even in sectors with predominantly minor incidents. Focus Areas for Improvement:

The chart highlights the need for targeted safety interventions in the Mining sector and other high-risk industries. Efforts should focus on reducing the overall number of accidents and preventing the escalation of minor incidents to more severe levels.

In [ ]:
# @title Distribution of Accident Levels Across Countries

import matplotlib.pyplot as plt

# Assuming 'df' is the DataFrame
city_accident_counts = df.groupby(['Country', 'Accident Level'])['Accident Level'].count().unstack()

city_accident_counts.plot(kind='bar', figsize=(10, 6), cmap='Set3')
plt.xlabel('Country')
plt.ylabel('Number of Accidents')
plt.title('Distribution of Accident Levels Across Countries')
plt.xticks(rotation=90)
_ = plt.tight_layout()
<ipython-input-41-a7d6d9dceec3>:6: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  city_accident_counts = df.groupby(['Country', 'Accident Level'])['Accident Level'].count().unstack()

Observations: Country_01

It consistently shows the highest number of accidents across all accident levels (I to VI). This suggests that Country_01 might have areas for improvement in safety measures compared to the other two countries. Country_02

It generally has the second-highest number of accidents, with a notable increase in level III accidents. This could indicate specific risks or practices within Country_02 that contribute to more severe accidents. Country_03

It has the lowest number of accidents across most levels, particularly in the more severe categories (IV to VI). This might suggest that Country_03 has relatively better safety protocols in place compared to the other countries. Across all countries, the number of accidents decreases as the accident level increases. This is expected, as more severe accidents are generally less frequent.

The distribution of accident levels varies across countries, highlighting potential differences in safety regulations, industry practices, or risk factors specific to each country.

In [ ]:
# @title Distribution of Accident Levels Across Cities

import matplotlib.pyplot as plt

# Assuming 'df' is the DataFrame
city_accident_counts = df.groupby(['City', 'Accident Level'])['Accident Level'].count().unstack()

city_accident_counts.plot(kind='bar', figsize=(15, 6), cmap='Set3')
plt.xlabel('City')
plt.ylabel('Number of Accidents')
plt.title('Distribution of Accident Levels Across Cities')
plt.xticks(rotation=90)
_ = plt.tight_layout()
<ipython-input-42-eeffb4b5c506>:6: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  city_accident_counts = df.groupby(['City', 'Accident Level'])['Accident Level'].count().unstack()

Observations: Accident Distribution:

Accidents are not uniformly distributed across cities. Some cities experience a significantly higher number of accidents compared to others. Severity Variation:

The distribution of accident levels (I to VI) varies across cities. Certain cities might have a higher proportion of severe accidents (levels IV to VI), while others might predominantly experience minor accidents (levels I and II). City-Specific Patterns:

Each city exhibits a unique pattern in terms of accident level distribution. 2. This suggests that factors contributing to accidents might differ from city to city. Potential Focus Areas:

Cities with a higher concentration of accidents, especially those with a higher proportion of severe accidents, could be prioritized for further investigation and safety interventions.

In [ ]:
# @title Country vs Industry Sector

from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
plt.subplots(figsize=(7, 6))
df_2dhist = pd.DataFrame({
    x_label: grp['Industry Sector'].value_counts()
    for x_label, grp in df.groupby('Country')
})
sns.heatmap(df_2dhist, cmap='Set3')
plt.xlabel('Country', fontsize=10)
_ = plt.ylabel('Industry Sector')
<ipython-input-43-d0fbe8a987ae>:9: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  for x_label, grp in df.groupby('Country')

Observations: Country_01:

Highest number of accidents across all industry sectors. Mining is the most accident-prone sector, followed by Metals. Relatively fewer accidents in the Others sector. Country_02:

Shows a more balanced distribution of accidents across sectors compared to Country_01. Mining and Metals still have a significant number of accidents. Country_03:

Has the lowest number of accidents overall. Mining remains a major concern, but other sectors show a relatively lower number of incidents. Overall:

Mining stands out as a high-risk industry across all three countries. Country_01 consistently shows a higher number of accidents compared to the other two countries. The distribution of accidents varies across countries, suggesting potential differences in safety practices or industry compositions.

In [ ]:
# @title  Critical Risk vs Industry Sector
plt.figure(figsize=(12, 18))
sns.countplot(y='Critical Risk', hue='Industry Sector', data=df, palette='Set2')
plt.title('Industry Sector vs Critical Risk')
plt.show()
In [ ]:
# @title Critical Risk vs Employee Type
plt.figure(figsize=(12, 18))
sns.countplot(y='Critical Risk', hue='Employee type', data=df, palette='Set2')
plt.title('Employee Type vs Critical Risk')
plt.show()

Observations: Environmental Risk:

It is the most frequently cited critical risk across all employee types. This suggests that environmental impact is a concern regardless of who is involved in the accident. Health and Safety Risk:

It is the second most common critical risk, particularly for Employees and Third Parties. This highlights the importance of ensuring the safety of both internal and external personnel. Process Safety Risk:

It is more prevalent among Employees, indicating that those directly involved in operational processes are more exposed to this type of risk. Other Risks:

Other critical risks, such as Asset Integrity and Security, are less frequent but still present across different employee types. Employee Type and Risk Correlation:

The distribution of critical risks varies slightly across employee types, suggesting that different roles and responsibilities might influence the types of risks encountered. Focus Areas for Improvement:

The plot emphasizes the need for tailored risk management strategies that consider the specific critical risks associated with different employee types. This could involve providing comprehensive safety training for all employees, implementing strict safety protocols for third-party workers, and enhancing process safety measures to protect those directly involved in operations.

In [ ]:
from datetime import datetime

def add_date_features(df):
    """
    Adds Weekend and Season columns to the dataframe.
    Args:
        df: The dataframe to add features to.
    Returns:
        The dataframe with the added features.
    """
    # Create a copy of the dataframe
    df_preprocess = df.copy()

    # Ensure the 'Date' column is in datetime format
    df_preprocess['Date'] = pd.to_datetime(df_preprocess['Date'])

    # Add Weekend feature
    df_preprocess['Weekend'] = df_preprocess['Date'].dt.dayofweek.isin([5, 6]).astype(int)

    # Add Season feature
    df_preprocess['Season'] = df_preprocess['Date'].dt.month.apply(
        lambda month: 'Summer' if month in [12, 1, 2] else
                      'Autumn' if month in [3, 4, 5] else
                      'Winter' if month in [6, 7, 8] else
                      'Spring'
    )

    # Remove Date column
    df_preprocess = df_preprocess.drop('Date', axis=1)

    return df_preprocess

# Apply the function to the actual dataframe
df_preprocess = add_date_features(df)

print(df_preprocess)
        Country      City Industry Sector Accident Level  \
0    Country_01  Local_01          Mining              1   
1    Country_02  Local_02          Mining              1   
2    Country_01  Local_03          Mining              1   
3    Country_01  Local_04          Mining              1   
4    Country_01  Local_04          Mining              4   
..          ...       ...             ...            ...   
420  Country_01  Local_04          Mining              1   
421  Country_01  Local_03          Mining              1   
422  Country_02  Local_09          Metals              1   
423  Country_02  Local_05          Metals              1   
424  Country_01  Local_04          Mining              1   

    Potential Accident Level  Gender        Employee type  \
0                          4    Male           Contractor   
1                          4    Male             Employee   
2                          3    Male  Contractor (Remote)   
3                          1    Male           Contractor   
4                          4    Male           Contractor   
..                       ...     ...                  ...   
420                        3    Male           Contractor   
421                        2  Female             Employee   
422                        2    Male             Employee   
423                        2    Male             Employee   
424                        2  Female           Contractor   

                    Critical Risk  \
0                         Pressed   
1             Pressurized Systems   
2                    Manual Tools   
3                          Others   
4                          Others   
..                            ...   
420                        Others   
421                        Others   
422              Venomous Animals   
423                           Cut   
424  Fall prevention (same level)   

                                           Description  Day    Weekday  \
0    While removing the drill rod of the Jumbo 08 f...    1     Friday   
1    During the activation of a sodium sulphide pum...    2   Saturday   
2    In the sub-station MILPO located at level +170...    6  Wednesday   
3    Being 9:45 am. approximately in the Nv. 1880 C...    8     Friday   
4    Approximately at 11:45 a.m. in circumstances t...   10     Sunday   
..                                                 ...  ...        ...   
420  Being approximately 5:00 a.m. approximately, w...    4    Tuesday   
421  The collaborator moved from the infrastructure...    4    Tuesday   
422  During the environmental monitoring activity i...    5  Wednesday   
423  The Employee performed the activity of strippi...    6   Thursday   
424  At 10:00 a.m., when the assistant cleaned the ...    9     Sunday   

     WeekofYear  Weekend  Season  
0            53        0  Summer  
1            53        1  Summer  
2             1        0  Summer  
3             1        0  Summer  
4             1        1  Summer  
..          ...      ...     ...  
420          27        0  Winter  
421          27        0  Winter  
422          27        0  Winter  
423          27        0  Winter  
424          27        1  Winter  

[418 rows x 14 columns]
In [ ]:
# @title Season vs Accident Levels, Potential Accident Levels

# Season vs Accident Level
plt.figure(figsize=(10, 6))
sns.countplot(x='Season', hue='Accident Level', data=df_preprocess, palette='Set2')
plt.title('Season vs Accident Level')
plt.show()

# Season vs Potential Accident Level
plt.figure(figsize=(10, 6))
sns.countplot(x='Season', hue='Potential Accident Level', data=df_preprocess, palette='Set2')
plt.title('Season vs Potential Accident Level')
plt.show()

Observations: Season vs Accident Level:

Accidents seem to be fairly evenly distributed across seasons, with a slight increase in Autumn. 2 This suggests that seasonal factors might not play a major role in the overall occurrence of accidents. However, it's worth investigating if specific types of accidents are more prevalent in certain seasons. Season vs Potential Accident Level:

Similar to the previous plot, the distribution of potential accident levels appears relatively consistent across seasons. This indicates that the potential severity of accidents is not strongly influenced by seasonal factors.

In [ ]:
# @title Potential Accident Level vs Weekend

from matplotlib import pyplot as plt
import seaborn as sns
figsize = (12, 1.2 * len(df_preprocess['Potential Accident Level'].unique()))
plt.figure(figsize=figsize)
sns.violinplot(df_preprocess, x='Weekend', y='Potential Accident Level', inner='stick', palette='Set2')
sns.despine(top=True, right=True, bottom=True, left=True)
<ipython-input-48-5ad1a0ee56a0>:7: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.violinplot(df_preprocess, x='Weekend', y='Potential Accident Level', inner='stick', palette='Set2')

Observations: Weekends vs Weekdays:

The distribution of potential accident levels appears relatively similar between weekends and weekdays. There isn't a strong indication that weekends have a significantly higher or lower likelihood of accidents with a certain potential severity level compared to weekdays. Potential Accident Level I:

It is the most frequent potential accident level for both weekends and weekdays, suggesting that most incidents, regardless of the day of the week, have a low potential for severe consequences. Higher Potential Accident Levels:

Potential accident levels III to VI are less frequent but present on both weekends and weekdays. This indicates that the possibility of more severe accidents exists throughout the week, although the likelihood is generally lower. Further Analysis:

While the violin plot provides a general overview, further statistical analysis might be needed to confirm whether there are any statistically significant differences in the distribution of potential accident levels between weekends and weekdays.

Step 3.2 NLP Analysis

In [ ]:
df_preprocess.to_csv('/content/drive/MyDrive/AIML_Capstone_Project/df_preprocess.csv', index=False)
In [ ]:
from collections import Counter
import re
import nltk
from nltk.corpus import stopwords

# Ensure stopwords are downloaded
nltk.download('stopwords')

# Function to clean and tokenize descriptions
def tokenize(text):
    # Use a regular expression to find words that are purely alphabetic
    tokens = re.findall(r'\b[a-zA-Z]+\b', text.lower())
    # Filter out stopwords
    stop_words = set(stopwords.words('english'))
    return [word for word in tokens if word not in stop_words]

# Assuming ISH_df_preprocess['Description'] contains the descriptions
# Tokenize each description and create a flat list of all words
all_words = [word for description in df_preprocess['Description'] for word in tokenize(description)]

# Count the frequency of each word
word_counts = Counter(all_words)

# Display the most common words to get insights for categorizing accidents
word_counts.most_common(50)
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Out[ ]:
[('causing', 166),
 ('hand', 163),
 ('employee', 156),
 ('left', 155),
 ('right', 154),
 ('operator', 126),
 ('injury', 104),
 ('time', 101),
 ('activity', 91),
 ('area', 80),
 ('moment', 78),
 ('equipment', 77),
 ('work', 76),
 ('accident', 73),
 ('collaborator', 71),
 ('level', 70),
 ('worker', 70),
 ('assistant', 68),
 ('finger', 68),
 ('pipe', 67),
 ('one', 65),
 ('floor', 65),
 ('support', 58),
 ('mesh', 58),
 ('rock', 54),
 ('safety', 53),
 ('mr', 53),
 ('approximately', 50),
 ('meters', 47),
 ('height', 46),
 ('described', 45),
 ('part', 44),
 ('team', 44),
 ('side', 43),
 ('injured', 42),
 ('truck', 42),
 ('face', 42),
 ('used', 42),
 ('kg', 40),
 ('circumstances', 39),
 ('cut', 39),
 ('gloves', 39),
 ('pump', 38),
 ('hit', 38),
 ('metal', 38),
 ('performing', 37),
 ('medical', 37),
 ('towards', 37),
 ('using', 35),
 ('made', 34)]
In [ ]:
# Function to tokenize descriptions, filtering out numbers and special characters
def tokenize(text):
    # Regular expression to find words that are purely alphabetic
    tokens = re.findall(r'\b[a-zA-Z]+\b', text.lower())
    # Filter out stopwords
    stop_words = set(stopwords.words('english'))
    return [word for word in tokens if word not in stop_words]

# Function to find phrases that might indicate new categories
def find_phrases(text, length=2):
    tokens = tokenize(text)
    return [' '.join(tokens[i:i+length]) for i in range(len(tokens) - length + 1)]

# Assuming ISH_df_preprocess['Description'] contains the descriptions
# Generate bi-grams (two-word phrases) from descriptions
bi_grams = [phrase for description in df_preprocess['Description'] for phrase in find_phrases(description, 2)]

# Count the frequency of each bi-gram
bi_gram_counts = Counter(bi_grams)

# Display the most common bi-grams to get insights for new categorizing accidents
bi_gram_counts.most_common(50)
Out[ ]:
[('left hand', 70),
 ('right hand', 57),
 ('time accident', 56),
 ('causing injury', 51),
 ('finger left', 22),
 ('employee reports', 22),
 ('injury described', 18),
 ('medical center', 17),
 ('described injury', 17),
 ('left foot', 15),
 ('injured person', 15),
 ('hand causing', 14),
 ('support mesh', 14),
 ('injury time', 14),
 ('right side', 13),
 ('finger right', 13),
 ('da silva', 13),
 ('allergic reaction', 13),
 ('right leg', 11),
 ('safety gloves', 11),
 ('made use', 10),
 ('fragment rock', 10),
 ('wearing safety', 10),
 ('time event', 10),
 ('right foot', 9),
 ('split set', 9),
 ('upper part', 9),
 ('left leg', 9),
 ('middle finger', 9),
 ('height meters', 9),
 ('ring finger', 9),
 ('left side', 9),
 ('accident employee', 9),
 ('weight kg', 8),
 ('generating injury', 8),
 ('causing cut', 8),
 ('generating described', 8),
 ('metal structure', 8),
 ('work area', 8),
 ('kg weight', 7),
 ('transferred medical', 7),
 ('master loader', 7),
 ('worker wearing', 7),
 ('index finger', 7),
 ('piece rock', 7),
 ('employee performing', 7),
 ('x cm', 7),
 ('lesion described', 7),
 ('used safety', 7),
 ('described time', 7)]
In [ ]:
# Function to tokenize descriptions, filtering out numbers and special characters
def tokenize(text):
    # Regular expression to find words that are purely alphabetic
    tokens = re.findall(r'\b[a-zA-Z]+\b', text.lower())
    # Filter out stopwords
    stop_words = set(stopwords.words('english'))
    return [word for word in tokens if word not in stop_words]

# Function to find phrases that might indicate new categories
def find_phrases(text, length=3):  # Adjust length default to 3 for trigrams
    tokens = tokenize(text)
    return [' '.join(tokens[i:i+length]) for i in range(len(tokens) - length + 1)]

# Assuming ISH_df_preprocess['Description'] contains the descriptions
# Generate trigrams (three-word phrases) from descriptions
tri_grams = [phrase for description in df_preprocess['Description'] for phrase in find_phrases(description)]

# Count the frequency of each trigram
tri_gram_counts = Counter(tri_grams)

# Display the most common trigrams to get insights for new categorizing accidents
tri_gram_counts.most_common(50)
Out[ ]:
[('finger left hand', 21),
 ('causing injury described', 13),
 ('finger right hand', 13),
 ('injury time accident', 13),
 ('generating described injury', 8),
 ('time accident employee', 8),
 ('hand causing injury', 7),
 ('described time accident', 7),
 ('left hand causing', 6),
 ('right hand causing', 6),
 ('back right hand', 5),
 ('worker wearing safety', 5),
 ('causing described injury', 5),
 ('cm x cm', 5),
 ('causing injury time', 5),
 ('returned normal activities', 5),
 ('manoel da silva', 5),
 ('approximately nv cx', 4),
 ('time accident worker', 4),
 ('accident worker wearing', 4),
 ('wearing safety gloves', 4),
 ('medical center attention', 4),
 ('made use safety', 4),
 ('used safety glasses', 4),
 ('generating injury time', 4),
 ('described injury time', 4),
 ('thermal recovery boiler', 4),
 ('verified type allergic', 4),
 ('type allergic reaction', 4),
 ('allergic reaction returned', 4),
 ('reaction returned normal', 4),
 ('generating lesion described', 4),
 ('place clerk wearing', 4),
 ('hand generating described', 4),
 ('employee reports performed', 4),
 ('hitting palm left', 3),
 ('palm left hand', 3),
 ('time fragment rock', 3),
 ('floor causing injury', 3),
 ('worker time accident', 3),
 ('transferred medical center', 3),
 ('little finger left', 3),
 ('index finger right', 3),
 ('type safety gloves', 3),
 ('circumstances two workers', 3),
 ('crown piece rock', 3),
 ('time event collaborator', 3),
 ('causing blunt cut', 3),
 ('use safety belt', 3),
 ('heavy equipment operator', 3)]
Wordclouds for Unigrams, Bigrams and Trigrams for Pre NLP data¶
In [ ]:
from wordcloud import WordCloud

# Create wordcloud for unigrams
wordcloud_unigrams = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(word_counts)

# Create wordcloud for bigrams
wordcloud_bigrams = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(bi_gram_counts)

# Create wordcloud for trigrams
wordcloud_trigrams = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(tri_gram_counts)

# Display the generated wordclouds
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud_unigrams, interpolation='bilinear')
plt.axis("off")
plt.title("Unigram Wordcloud")
plt.show()

plt.subplots_adjust(hspace=1)  # Adjust vertical spacing

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud_bigrams, interpolation='bilinear')
plt.axis("off")
plt.title("Bigram Wordcloud")
plt.show()

plt.subplots_adjust(hspace=1)  # Adjust vertical spacing

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud_trigrams, interpolation='bilinear')
plt.axis("off")
plt.title("Trigram Wordcloud")
plt.show()
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>

Step 3.2.1 NLP Pre-processing

Data preprocessing (NLP Preprocessing techniques)¶

Few of the NLP pre-processing steps taken before applying model on the data

Converting to lower case, avoid any capital cases Converting apostrophe to the standard lexicons Removing punctuations Lemmatization Removing stop words

In [ ]:
import nltk
nltk.download('punkt', force=True)
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Out[ ]:
True
In [ ]:
import os
nltk_data_dir = os.path.expanduser('~/nltk_data')
if os.path.exists(nltk_data_dir):
    import shutil
    shutil.rmtree(nltk_data_dir)  # Remove the corrupted nltk_data folder
In [ ]:
# Redownload necessary resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
Out[ ]:
True
In [ ]:
import os
import nltk

# Remove the NLTK data folder
nltk_data_dir = os.path.expanduser('~/nltk_data')
if os.path.exists(nltk_data_dir):
    import shutil
    shutil.rmtree(nltk_data_dir)
In [ ]:
!pip install --upgrade nltk
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.9.1)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk) (8.1.7)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk) (1.4.2)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk) (2024.9.11)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk) (4.66.6)
In [ ]:
!pip install spacy
!python -m spacy download en_core_web_sm
Requirement already satisfied: spacy in /usr/local/lib/python3.10/dist-packages (3.7.5)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /usr/local/lib/python3.10/dist-packages (from spacy) (3.0.12)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from spacy) (1.0.5)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.10/dist-packages (from spacy) (1.0.11)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy) (2.0.10)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy) (3.0.9)
Requirement already satisfied: thinc<8.3.0,>=8.2.2 in /usr/local/lib/python3.10/dist-packages (from spacy) (8.2.5)
Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /usr/local/lib/python3.10/dist-packages (from spacy) (1.1.3)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.10/dist-packages (from spacy) (2.5.0)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.10/dist-packages (from spacy) (2.0.10)
Requirement already satisfied: weasel<0.5.0,>=0.1.0 in /usr/local/lib/python3.10/dist-packages (from spacy) (0.4.1)
Requirement already satisfied: typer<1.0.0,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from spacy) (0.15.1)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.10/dist-packages (from spacy) (4.66.6)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from spacy) (2.32.3)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /usr/local/lib/python3.10/dist-packages (from spacy) (2.10.3)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from spacy) (3.1.4)
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from spacy) (75.1.0)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from spacy) (24.2)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.10/dist-packages (from spacy) (3.5.0)
Requirement already satisfied: numpy>=1.19.0 in /usr/local/lib/python3.10/dist-packages (from spacy) (1.26.4)
Requirement already satisfied: language-data>=1.2 in /usr/local/lib/python3.10/dist-packages (from langcodes<4.0.0,>=3.2.0->spacy) (1.3.0)
Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.10/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy) (0.7.0)
Requirement already satisfied: pydantic-core==2.27.1 in /usr/local/lib/python3.10/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy) (2.27.1)
Requirement already satisfied: typing-extensions>=4.12.2 in /usr/local/lib/python3.10/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy) (4.12.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (2024.8.30)
Requirement already satisfied: blis<0.8.0,>=0.7.8 in /usr/local/lib/python3.10/dist-packages (from thinc<8.3.0,>=8.2.2->spacy) (0.7.11)
Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.10/dist-packages (from thinc<8.3.0,>=8.2.2->spacy) (0.1.5)
Requirement already satisfied: click>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from typer<1.0.0,>=0.3.0->spacy) (8.1.7)
Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.10/dist-packages (from typer<1.0.0,>=0.3.0->spacy) (1.5.4)
Requirement already satisfied: rich>=10.11.0 in /usr/local/lib/python3.10/dist-packages (from typer<1.0.0,>=0.3.0->spacy) (13.9.4)
Requirement already satisfied: cloudpathlib<1.0.0,>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from weasel<0.5.0,>=0.1.0->spacy) (0.20.0)
Requirement already satisfied: smart-open<8.0.0,>=5.2.1 in /usr/local/lib/python3.10/dist-packages (from weasel<0.5.0,>=0.1.0->spacy) (7.0.5)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->spacy) (3.0.2)
Requirement already satisfied: marisa-trie>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from language-data>=1.2->langcodes<4.0.0,>=3.2.0->spacy) (1.2.1)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy) (2.18.0)
Requirement already satisfied: wrapt in /usr/local/lib/python3.10/dist-packages (from smart-open<8.0.0,>=5.2.1->weasel<0.5.0,>=0.1.0->spacy) (1.17.0)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy) (0.1.2)
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 110.1 MB/s eta 0:00:00
Requirement already satisfied: spacy<3.8.0,>=3.7.2 in /usr/local/lib/python3.10/dist-packages (from en-core-web-sm==3.7.1) (3.7.5)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (3.0.12)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (1.0.5)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (1.0.11)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (2.0.10)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (3.0.9)
Requirement already satisfied: thinc<8.3.0,>=8.2.2 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (8.2.5)
Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (1.1.3)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (2.5.0)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (2.0.10)
Requirement already satisfied: weasel<0.5.0,>=0.1.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (0.4.1)
Requirement already satisfied: typer<1.0.0,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (0.15.1)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (4.66.6)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (2.32.3)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (2.10.3)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (3.1.4)
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (75.1.0)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (24.2)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (3.5.0)
Requirement already satisfied: numpy>=1.19.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (1.26.4)
Requirement already satisfied: language-data>=1.2 in /usr/local/lib/python3.10/dist-packages (from langcodes<4.0.0,>=3.2.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (1.3.0)
Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.10/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (0.7.0)
Requirement already satisfied: pydantic-core==2.27.1 in /usr/local/lib/python3.10/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (2.27.1)
Requirement already satisfied: typing-extensions>=4.12.2 in /usr/local/lib/python3.10/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (4.12.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (2024.8.30)
Requirement already satisfied: blis<0.8.0,>=0.7.8 in /usr/local/lib/python3.10/dist-packages (from thinc<8.3.0,>=8.2.2->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (0.7.11)
Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.10/dist-packages (from thinc<8.3.0,>=8.2.2->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (0.1.5)
Requirement already satisfied: click>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from typer<1.0.0,>=0.3.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (8.1.7)
Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.10/dist-packages (from typer<1.0.0,>=0.3.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (1.5.4)
Requirement already satisfied: rich>=10.11.0 in /usr/local/lib/python3.10/dist-packages (from typer<1.0.0,>=0.3.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (13.9.4)
Requirement already satisfied: cloudpathlib<1.0.0,>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from weasel<0.5.0,>=0.1.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (0.20.0)
Requirement already satisfied: smart-open<8.0.0,>=5.2.1 in /usr/local/lib/python3.10/dist-packages (from weasel<0.5.0,>=0.1.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (7.0.5)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (3.0.2)
Requirement already satisfied: marisa-trie>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from language-data>=1.2->langcodes<4.0.0,>=3.2.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (1.2.1)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (2.18.0)
Requirement already satisfied: wrapt in /usr/local/lib/python3.10/dist-packages (from smart-open<8.0.0,>=5.2.1->weasel<0.5.0,>=0.1.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (1.17.0)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (0.1.2)
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')
⚠ Restart to reload dependencies
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
In [ ]:
import spacy
nlp = spacy.load('en_core_web_sm')

def preprocess_text_spacy(text):
    # Tokenize and preprocess using Spacy
    doc = nlp(text.lower())
    tokens = [token.lemma_ for token in doc if not token.is_stop and token.is_alpha]
    return ' '.join(tokens)

# Apply preprocessing
df_preprocess['Cleaned_Description'] = df_preprocess['Description'].apply(preprocess_text_spacy)
In [ ]:
# Save the number of words before and after cleaning
df_preprocess['Original_Word_Count'] = df_preprocess['Description'].apply(lambda x: len(str(x).split()))
df_preprocess['Cleaned_Word_Count'] = df_preprocess['Cleaned_Description'].apply(lambda x: len(str(x).split()))

# Display the first few rows of the original and cleaned descriptions
print(df_preprocess[['Description', 'Cleaned_Description']].head())
                                         Description  \
0  While removing the drill rod of the Jumbo 08 f...   
1  During the activation of a sodium sulphide pum...   
2  In the sub-station MILPO located at level +170...   
3  Being 9:45 am. approximately in the Nv. 1880 C...   
4  Approximately at 11:45 a.m. in circumstances t...   

                                 Cleaned_Description  
0  remove drill rod jumbo maintenance supervisor ...  
1  activation sodium sulphide pump piping uncoupl...  
2  sub station milpo locate level collaborator ex...  
3  approximately nv personnel begin task unlock s...  
4  approximately circumstance mechanic anthony gr...  
In [ ]:
df_preprocess[['Description', 'Cleaned_Description']].head()
Out[ ]:
Description Cleaned_Description
0 While removing the drill rod of the Jumbo 08 f... remove drill rod jumbo maintenance supervisor ...
1 During the activation of a sodium sulphide pum... activation sodium sulphide pump piping uncoupl...
2 In the sub-station MILPO located at level +170... sub station milpo locate level collaborator ex...
3 Being 9:45 am. approximately in the Nv. 1880 C... approximately nv personnel begin task unlock s...
4 Approximately at 11:45 a.m. in circumstances t... approximately circumstance mechanic anthony gr...
In [ ]:
df_preprocess
Out[ ]:
Country City Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Description Day Weekday WeekofYear Weekend Season Cleaned_Description Original_Word_Count Cleaned_Word_Count
0 Country_01 Local_01 Mining 1 4 Male Contractor Pressed While removing the drill rod of the Jumbo 08 f... 1 Friday 53 0 Summer remove drill rod jumbo maintenance supervisor ... 80 36
1 Country_02 Local_02 Mining 1 4 Male Employee Pressurized Systems During the activation of a sodium sulphide pum... 2 Saturday 53 1 Summer activation sodium sulphide pump piping uncoupl... 54 26
2 Country_01 Local_03 Mining 1 3 Male Contractor (Remote) Manual Tools In the sub-station MILPO located at level +170... 6 Wednesday 1 0 Summer sub station milpo locate level collaborator ex... 57 28
3 Country_01 Local_04 Mining 1 1 Male Contractor Others Being 9:45 am. approximately in the Nv. 1880 C... 8 Friday 1 0 Summer approximately nv personnel begin task unlock s... 97 47
4 Country_01 Local_04 Mining 4 4 Male Contractor Others Approximately at 11:45 a.m. in circumstances t... 10 Sunday 1 1 Summer approximately circumstance mechanic anthony gr... 88 42
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
420 Country_01 Local_04 Mining 1 3 Male Contractor Others Being approximately 5:00 a.m. approximately, w... 4 Tuesday 27 0 Winter approximately approximately lift kelly hq pull... 38 16
421 Country_01 Local_03 Mining 1 2 Female Employee Others The collaborator moved from the infrastructure... 4 Tuesday 27 0 Winter collaborator move infrastructure office julio ... 39 20
422 Country_02 Local_09 Metals 1 2 Male Employee Venomous Animals During the environmental monitoring activity i... 5 Wednesday 27 0 Winter environmental monitoring activity area employe... 44 19
423 Country_02 Local_05 Metals 1 2 Male Employee Cut The Employee performed the activity of strippi... 6 Thursday 27 0 Winter employee perform activity strip cathode pull c... 33 17
424 Country_01 Local_04 Mining 1 2 Female Contractor Fall prevention (same level) At 10:00 a.m., when the assistant cleaned the ... 9 Sunday 27 1 Winter assistant clean floor module e central camp sl... 35 18

418 rows × 17 columns

In [ ]:
# Calculate and print the average word count before and after cleaning
avg_original = df_preprocess['Original_Word_Count'].mean()
avg_cleaned = df_preprocess['Cleaned_Word_Count'].mean()
print(f"\nAverage word count before cleaning: {avg_original:.2f}")
print(f"Average word count after cleaning: {avg_cleaned:.2f}")
print(f"Reduction in words: {(avg_original - avg_cleaned) / avg_original * 100:.2f}%")
Average word count before cleaning: 65.06
Average word count after cleaning: 30.89
Reduction in words: 52.52%
In [ ]:
# Removing the repetitive and unnecessary columns which is not required for analysis

Unnecessary_Columns = ['Description','Original_Word_Count','Cleaned_Word_Count']

# Drop unnecessary columns
df_preprocess = df_preprocess.drop(Unnecessary_Columns, axis=1)

df_preprocess
Out[ ]:
Country City Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Day Weekday WeekofYear Weekend Season Cleaned_Description
0 Country_01 Local_01 Mining 1 4 Male Contractor Pressed 1 Friday 53 0 Summer remove drill rod jumbo maintenance supervisor ...
1 Country_02 Local_02 Mining 1 4 Male Employee Pressurized Systems 2 Saturday 53 1 Summer activation sodium sulphide pump piping uncoupl...
2 Country_01 Local_03 Mining 1 3 Male Contractor (Remote) Manual Tools 6 Wednesday 1 0 Summer sub station milpo locate level collaborator ex...
3 Country_01 Local_04 Mining 1 1 Male Contractor Others 8 Friday 1 0 Summer approximately nv personnel begin task unlock s...
4 Country_01 Local_04 Mining 4 4 Male Contractor Others 10 Sunday 1 1 Summer approximately circumstance mechanic anthony gr...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
420 Country_01 Local_04 Mining 1 3 Male Contractor Others 4 Tuesday 27 0 Winter approximately approximately lift kelly hq pull...
421 Country_01 Local_03 Mining 1 2 Female Employee Others 4 Tuesday 27 0 Winter collaborator move infrastructure office julio ...
422 Country_02 Local_09 Metals 1 2 Male Employee Venomous Animals 5 Wednesday 27 0 Winter environmental monitoring activity area employe...
423 Country_02 Local_05 Metals 1 2 Male Employee Cut 6 Thursday 27 0 Winter employee perform activity strip cathode pull c...
424 Country_01 Local_04 Mining 1 2 Female Contractor Fall prevention (same level) 9 Sunday 27 1 Winter assistant clean floor module e central camp sl...

418 rows × 14 columns

In [ ]:
df_preprocess.columns
Out[ ]:
Index(['Country', 'City', 'Industry Sector', 'Accident Level',
       'Potential Accident Level', 'Gender', 'Employee type', 'Critical Risk',
       'Day', 'Weekday', 'WeekofYear', 'Weekend', 'Season',
       'Cleaned_Description'],
      dtype='object')
In [ ]:
# Rename Cleaned Desription to Description
df_preprocess = df_preprocess.rename(columns={'Cleaned_Description': 'Description'})
df_preprocess
Out[ ]:
Country City Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Day Weekday WeekofYear Weekend Season Description
0 Country_01 Local_01 Mining 1 4 Male Contractor Pressed 1 Friday 53 0 Summer remove drill rod jumbo maintenance supervisor ...
1 Country_02 Local_02 Mining 1 4 Male Employee Pressurized Systems 2 Saturday 53 1 Summer activation sodium sulphide pump piping uncoupl...
2 Country_01 Local_03 Mining 1 3 Male Contractor (Remote) Manual Tools 6 Wednesday 1 0 Summer sub station milpo locate level collaborator ex...
3 Country_01 Local_04 Mining 1 1 Male Contractor Others 8 Friday 1 0 Summer approximately nv personnel begin task unlock s...
4 Country_01 Local_04 Mining 4 4 Male Contractor Others 10 Sunday 1 1 Summer approximately circumstance mechanic anthony gr...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
420 Country_01 Local_04 Mining 1 3 Male Contractor Others 4 Tuesday 27 0 Winter approximately approximately lift kelly hq pull...
421 Country_01 Local_03 Mining 1 2 Female Employee Others 4 Tuesday 27 0 Winter collaborator move infrastructure office julio ...
422 Country_02 Local_09 Metals 1 2 Male Employee Venomous Animals 5 Wednesday 27 0 Winter environmental monitoring activity area employe...
423 Country_02 Local_05 Metals 1 2 Male Employee Cut 6 Thursday 27 0 Winter employee perform activity strip cathode pull c...
424 Country_01 Local_04 Mining 1 2 Female Contractor Fall prevention (same level) 9 Sunday 27 1 Winter assistant clean floor module e central camp sl...

418 rows × 14 columns

In [ ]:
# Save the preprocessed data
df_preprocess.to_csv('/content/drive/MyDrive/AIML_Capstone_Project/df_preprocess.csv', index=False)
In [ ]:
from collections import Counter

# Load the preprocessed data
df_preprocess = pd.read_csv('/content/drive/MyDrive/AIML_Capstone_Project/df_preprocess.csv')

import spacy
from collections import Counter

# Load Spacy model
nlp = spacy.load('en_core_web_sm')

# Combine all descriptions into a single string
all_text = ' '.join(df_preprocess['Description'].astype(str))

# Tokenize text using Spacy
doc = nlp(all_text)
tokens = [token.text for token in doc if token.is_alpha]

# Calculate token distribution
token_counts = Counter(tokens)

# Create a DataFrame from the most common words
top_words_df = pd.DataFrame(token_counts.most_common(30), columns=['Word', 'Count'])

# Display the DataFrame
print(top_words_df)
/usr/local/lib/python3.10/dist-packages/spacy/util.py:1740: UserWarning: [W111] Jupyter notebook detected: if using `prefer_gpu()` or `require_gpu()`, include it in the same cell right before `spacy.load()` to ensure that the model is loaded on the correct device. More information: http://spacy.io/usage/v3#jupyter-notebook-gpu
  warnings.warn(Warnings.W111)
            Word  Count
0          cause    190
1           hand    177
2       employee    172
3          right    154
4           left    138
5       operator    132
6       activity    117
7           time    112
8         injury    110
9         moment    101
10           hit     97
11          fall     87
12        worker     87
13          work     86
14  collaborator     81
15       perform     81
16          area     80
17     equipment     76
18        finger     76
19     assistant     75
20      accident     73
21          pipe     71
22       support     70
23         level     70
24         floor     65
25            cm     64
26        remove     60
27          mesh     59
28         place     57
29           cut     57

Step 3.2.2 NLP Visualization

In [ ]:
import spacy
from nltk.util import ngrams
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Load Spacy model
nlp = spacy.load('en_core_web_sm')

# Combine all descriptions into a single string
all_text = ' '.join(df_preprocess['Description'].astype(str))

# Tokenize text using Spacy
doc = nlp(all_text)
tokens = [token.text for token in doc if token.is_alpha]

# Generate word cloud for unigrams
unigram_text = ' '.join(tokens)
wordcloud_unigrams = WordCloud(width=800, height=400, background_color='white').generate(unigram_text)

# Generate word cloud for bigrams
bigrams = ['_'.join(bigram) for bigram in ngrams(tokens, 2)]
bigram_text = ' '.join(bigrams)
wordcloud_bigrams = WordCloud(width=800, height=400, background_color='white').generate(bigram_text)

# Generate word cloud for trigrams
trigrams = ['_'.join(trigram) for trigram in ngrams(tokens, 3)]
trigram_text = ' '.join(trigrams)
wordcloud_trigrams = WordCloud(width=800, height=400, background_color='white').generate(trigram_text)

# Display the word clouds
plt.figure(figsize=(20, 10))
plt.subplot(1, 3, 1)
plt.imshow(wordcloud_unigrams, interpolation='bilinear')
plt.title('Unigrams')
plt.axis('off')

plt.subplot(1, 3, 2)
plt.imshow(wordcloud_bigrams, interpolation='bilinear')
plt.title('Bigrams')
plt.axis('off')

plt.subplot(1, 3, 3)
plt.imshow(wordcloud_trigrams, interpolation='bilinear')
plt.title('Trigrams')
plt.axis('off')

plt.tight_layout()
plt.show()
/usr/local/lib/python3.10/dist-packages/spacy/util.py:1740: UserWarning: [W111] Jupyter notebook detected: if using `prefer_gpu()` or `require_gpu()`, include it in the same cell right before `spacy.load()` to ensure that the model is loaded on the correct device. More information: http://spacy.io/usage/v3#jupyter-notebook-gpu
  warnings.warn(Warnings.W111)

Observations: Unigrams:

Key words include "moment," "employee," "floor," "equipment," "assistant," "left," and "hand." These suggest an incident involving an employee and equipment on a specific floor. Words like "collaboration," "injury," and "support" indicate teamwork and possibly injury response. "Left" near "hand" points to a body part, likely in a workplace injury report. This might relate to a safety analysis or accident report in an industrial setting. Bigrams:

Frequent bigrams like "left hand" and "right hand" indicate a focus on hand and finger injuries. This suggests frequent hand-related injuries in the analyzed data or reports. Other terms like "left leg" and "left foot" appear but are less common. Phrases like "causing injury" and "employee performing" point to work-related injuries. Terms such as "causing cut" and "causing fall" highlight common injury mechanisms. Trigrams:

Trigrams like "left hand causing" and "finger left hand" focus on injuries to the left hand or fingers. Phrases like "used safety glass" suggest the involvement of specific safety measures. The emphasis on hands and fingers shows their vulnerability in the workplace. The analysis details injury causes and is useful for prevention. Words like "operator" and "employee" next to "accident" and "injury" emphasize roles in safety protocols. Overall:

N-grams analysis offers insights into key themes and patterns in incident reports. It identifies accident contributors and areas for safety improvement. The findings could help develop interventions to enhance workplace safety.

In [ ]:
# Function to preprocess and tokenize descriptions
def preprocess_and_tokenize(description):
    # Convert to lowercase
    description = description.lower()
    # Remove punctuation and non-alphabetic characters
    description = re.sub(r'[^a-z\s]', '', description)
    # Tokenize (split by whitespace)
    words = description.split()
    return words

# Apply the preprocessing function
df_preprocess['tokenized_words'] = df_preprocess['Description'].apply(preprocess_and_tokenize)
In [ ]:
df_preprocess
Out[ ]:
Country City Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Day Weekday WeekofYear Weekend Season Description tokenized_words
0 Country_01 Local_01 Mining 1 4 Male Contractor Pressed 1 Friday 53 0 Summer remove drill rod jumbo maintenance supervisor ... [remove, drill, rod, jumbo, maintenance, super...
1 Country_02 Local_02 Mining 1 4 Male Employee Pressurized Systems 2 Saturday 53 1 Summer activation sodium sulphide pump piping uncoupl... [activation, sodium, sulphide, pump, piping, u...
2 Country_01 Local_03 Mining 1 3 Male Contractor (Remote) Manual Tools 6 Wednesday 1 0 Summer sub station milpo locate level collaborator ex... [sub, station, milpo, locate, level, collabora...
3 Country_01 Local_04 Mining 1 1 Male Contractor Others 8 Friday 1 0 Summer approximately nv personnel begin task unlock s... [approximately, nv, personnel, begin, task, un...
4 Country_01 Local_04 Mining 4 4 Male Contractor Others 10 Sunday 1 1 Summer approximately circumstance mechanic anthony gr... [approximately, circumstance, mechanic, anthon...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
413 Country_01 Local_04 Mining 1 3 Male Contractor Others 4 Tuesday 27 0 Winter approximately approximately lift kelly hq pull... [approximately, approximately, lift, kelly, hq...
414 Country_01 Local_03 Mining 1 2 Female Employee Others 4 Tuesday 27 0 Winter collaborator move infrastructure office julio ... [collaborator, move, infrastructure, office, j...
415 Country_02 Local_09 Metals 1 2 Male Employee Venomous Animals 5 Wednesday 27 0 Winter environmental monitoring activity area employe... [environmental, monitoring, activity, area, em...
416 Country_02 Local_05 Metals 1 2 Male Employee Cut 6 Thursday 27 0 Winter employee perform activity strip cathode pull c... [employee, perform, activity, strip, cathode, ...
417 Country_01 Local_04 Mining 1 2 Female Contractor Fall prevention (same level) 9 Sunday 27 1 Winter assistant clean floor module e central camp sl... [assistant, clean, floor, module, e, central, ...

418 rows × 15 columns

In [ ]:
df_preprocess.shape
Out[ ]:
(418, 15)
In [ ]:
df_preprocess.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Country                   418 non-null    object
 1   City                      418 non-null    object
 2   Industry Sector           418 non-null    object
 3   Accident Level            418 non-null    int64 
 4   Potential Accident Level  418 non-null    int64 
 5   Gender                    418 non-null    object
 6   Employee type             418 non-null    object
 7   Critical Risk             418 non-null    object
 8   Day                       418 non-null    int64 
 9   Weekday                   418 non-null    object
 10  WeekofYear                418 non-null    int64 
 11  Weekend                   418 non-null    int64 
 12  Season                    418 non-null    object
 13  Description               418 non-null    object
 14  tokenized_words           418 non-null    object
dtypes: int64(5), object(10)
memory usage: 49.1+ KB
In [ ]:
df_preprocess1 = df_preprocess.copy()
In [ ]:
df_preprocess2 = df_preprocess.copy()
In [ ]:
df_preprocess.columns
Out[ ]:
Index(['Country', 'City', 'Industry Sector', 'Accident Level',
       'Potential Accident Level', 'Gender', 'Employee type', 'Critical Risk',
       'Day', 'Weekday', 'WeekofYear', 'Weekend', 'Season', 'Description',
       'tokenized_words'],
      dtype='object')

Step 3.2.2 NLP Visualization

NLP Pre-processing Summary

Few of the NLP pre-processing steps taken before applying model on the data

  • Converting to lower case, avoid any capital cases
  • Converting to lower case, avoid any capital cases
  • Converting to lower case, avoid any capital cases
  • Removing punctuations
  • Lemmatization
  • removing stop words
  • After pre-processing steps
  • filtering out numbers and special character

After pre-processing steps:

  • Average word count before cleaning: 65.06
  • Average word count after cleaning: 32.80
  • Reduction in words: 49.58%

Step 4 - Data preparation - Cleansed data in .xlsx

In [ ]:
# Save the preprocessed data
df_preprocess.to_csv('/content/drive/MyDrive/AIML_Capstone_Project/df_preprocess_10122024.csv', index=False)
df_preprocess.to_csv('/content/drive/MyDrive/AIML_Capstone_Project/df_preprocess_14122024.csv', index=False)

Step 5: Design train and test basic machine learning classifiers

Before starting to Build Model classifiers,completing the Feature Engineering

In [ ]:
df_preprocess1 = df_preprocess.copy()
In [ ]:
df_preprocess2 = df_preprocess.copy()

Generating Word Embeddings over 'Description' column using Glove, TFI-DF and Word2Vec

In [ ]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

def generate_embedding_dataframes(df):
    df1 = df.copy()
    df2 = df.copy()
    df3 = df.copy()

    # 1. GloVe Embeddings
    def load_glove_model(glove_file):
        embedding_dict = {}
        with open(glove_file, 'r', encoding="utf8") as f:
            for line in f:
                values = line.split()
                word = values[0]
                vector = np.asarray(values[1:], "float32")
                embedding_dict[word] = vector
        return embedding_dict

    def get_average_glove_embeddings(tokenized_words, embedding_dict, embedding_dim=300):
        embeddings = [embedding_dict.get(word, np.zeros(embedding_dim)) for word in tokenized_words]
        return np.mean(embeddings, axis=0) if embeddings else np.zeros(embedding_dim)
 # Load GloVe model and generate GloVe embeddings
    glove_file = '/content/drive/MyDrive/AIML_Capstone_Project/glove.6B/glove.6B.300d.txt'
    glove_embeddings = load_glove_model(glove_file)

    glove_embeddings_series = df1['tokenized_words'].apply(lambda words: get_average_glove_embeddings(words, glove_embeddings))
    Glove_df = pd.concat([df1.drop(columns=['tokenized_words']), pd.DataFrame(glove_embeddings_series.tolist(), columns=[f'GloVe_{i}' for i in range(300)])], axis=1)

    # 2. TF-IDF Features
    tfidf_vectorizer = TfidfVectorizer(tokenizer=lambda x: x, lowercase=False, token_pattern=None)
    tfidf_matrix = tfidf_vectorizer.fit_transform(df2['tokenized_words'])

    # Create a DataFrame with TF-IDF features
    tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
    TFIDF_df = pd.concat([df2.drop(columns=['tokenized_words']), tfidf_df], axis=1)

    # 3. Word2Vec Embeddings
    word2vec_model = Word2Vec(sentences=df3['tokenized_words'], vector_size=300, window=5, min_count=1, workers=4)

    def get_average_word2vec_embeddings(tokenized_words, model, embedding_dim=300):
        embeddings = [model.wv[word] for word in tokenized_words if word in model.wv]
        return np.mean(embeddings, axis=0) if embeddings else np.zeros(embedding_dim)

    word2vec_embeddings_series = df3['tokenized_words'].apply(lambda words: get_average_word2vec_embeddings(words, word2vec_model))
    Word2Vec_df = pd.concat([df3.drop(columns=['tokenized_words']), pd.DataFrame(word2vec_embeddings_series.tolist(), columns=[f'Word2Vec_{i}' for i in range(300)])], axis=1)

    return Glove_df, TFIDF_df, Word2Vec_df

# Use the function to generate the DataFrames
Glove_df, TFIDF_df, Word2Vec_df = generate_embedding_dataframes(df_preprocess1)
In [ ]:
Glove_df
Out[ ]:
Country City Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Day Weekday ... GloVe_290 GloVe_291 GloVe_292 GloVe_293 GloVe_294 GloVe_295 GloVe_296 GloVe_297 GloVe_298 GloVe_299
0 Country_01 Local_01 Mining 1 4 Male Contractor Pressed 1 Friday ... -0.027645 -0.119045 -0.061173 -0.065187 0.026949 0.197509 -0.013762 -0.348437 -0.066048 0.009923
1 Country_02 Local_02 Mining 1 4 Male Employee Pressurized Systems 2 Saturday ... -0.432424 -0.117516 0.034178 0.038456 0.132852 -0.166636 0.068733 -0.216856 -0.043625 -0.046566
2 Country_01 Local_03 Mining 1 3 Male Contractor (Remote) Manual Tools 6 Wednesday ... -0.006795 -0.161874 0.020432 0.085459 0.095127 0.220992 0.045661 -0.145386 0.004915 -0.032415
3 Country_01 Local_04 Mining 1 1 Male Contractor Others 8 Friday ... -0.048605 -0.088765 0.090351 -0.046184 -0.033896 0.236031 -0.110033 -0.125069 -0.052548 -0.041803
4 Country_01 Local_04 Mining 4 4 Male Contractor Others 10 Sunday ... 0.111791 -0.073450 0.056802 -0.105797 0.130160 0.158870 -0.042821 -0.077945 -0.038460 -0.072341
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
413 Country_01 Local_04 Mining 1 3 Male Contractor Others 4 Tuesday ... 0.028515 -0.027942 -0.084710 -0.077906 0.143589 0.281201 -0.145845 -0.103791 0.128524 -0.140132
414 Country_01 Local_03 Mining 1 2 Female Employee Others 4 Tuesday ... 0.042896 -0.137367 0.061687 0.069979 0.087773 0.194813 -0.065351 -0.239557 0.018276 -0.023313
415 Country_02 Local_09 Metals 1 2 Male Employee Venomous Animals 5 Wednesday ... 0.105456 -0.072907 -0.117373 0.090857 0.142089 0.118909 -0.001446 0.063939 -0.069832 -0.082433
416 Country_02 Local_05 Metals 1 2 Male Employee Cut 6 Thursday ... -0.113244 -0.122123 0.062463 0.132644 0.055348 0.084847 0.011991 -0.117702 0.073389 -0.212512
417 Country_01 Local_04 Mining 1 2 Female Contractor Fall prevention (same level) 9 Sunday ... -0.040730 0.015842 -0.097046 0.006672 0.197474 0.048899 0.020562 -0.270391 -0.051318 -0.059785

418 rows × 314 columns

In [ ]:
TFIDF_df
Out[ ]:
Country City Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Day Weekday ... yield yolk young zaf zamac zero zinc zinco zn zone
0 Country_01 Local_01 Mining 1 4 Male Contractor Pressed 1 Friday ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0
1 Country_02 Local_02 Mining 1 4 Male Employee Pressurized Systems 2 Saturday ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0
2 Country_01 Local_03 Mining 1 3 Male Contractor (Remote) Manual Tools 6 Wednesday ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0
3 Country_01 Local_04 Mining 1 1 Male Contractor Others 8 Friday ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0
4 Country_01 Local_04 Mining 4 4 Male Contractor Others 10 Sunday ... 0.0 0.0 0.0 0.209125 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
413 Country_01 Local_04 Mining 1 3 Male Contractor Others 4 Tuesday ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0
414 Country_01 Local_03 Mining 1 2 Female Employee Others 4 Tuesday ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0
415 Country_02 Local_09 Metals 1 2 Male Employee Venomous Animals 5 Wednesday ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0
416 Country_02 Local_05 Metals 1 2 Male Employee Cut 6 Thursday ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0
417 Country_01 Local_04 Mining 1 2 Female Contractor Fall prevention (same level) 9 Sunday ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0

418 rows × 2372 columns

In [ ]:
Word2Vec_df
Out[ ]:
Country City Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Day Weekday ... Word2Vec_290 Word2Vec_291 Word2Vec_292 Word2Vec_293 Word2Vec_294 Word2Vec_295 Word2Vec_296 Word2Vec_297 Word2Vec_298 Word2Vec_299
0 Country_01 Local_01 Mining 1 4 Male Contractor Pressed 1 Friday ... 0.002379 0.015691 0.011600 0.001926 0.016089 0.015971 -0.000278 -0.012707 0.009473 -0.001360
1 Country_02 Local_02 Mining 1 4 Male Employee Pressurized Systems 2 Saturday ... 0.001062 0.005288 0.004659 0.000580 0.005845 0.006274 0.000318 -0.004185 0.003862 -0.001172
2 Country_01 Local_03 Mining 1 3 Male Contractor (Remote) Manual Tools 6 Wednesday ... 0.002426 0.015521 0.012403 0.001232 0.016147 0.016360 0.001063 -0.012123 0.009406 -0.002111
3 Country_01 Local_04 Mining 1 1 Male Contractor Others 8 Friday ... 0.001808 0.014007 0.010629 0.000948 0.013540 0.013591 0.000679 -0.011329 0.009131 -0.001737
4 Country_01 Local_04 Mining 4 4 Male Contractor Others 10 Sunday ... 0.001734 0.013645 0.010474 0.001372 0.013937 0.014240 0.001025 -0.010936 0.008495 -0.001456
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
413 Country_01 Local_04 Mining 1 3 Male Contractor Others 4 Tuesday ... 0.002582 0.012164 0.009392 0.000438 0.013222 0.012999 0.000988 -0.009980 0.008245 -0.001985
414 Country_01 Local_03 Mining 1 2 Female Employee Others 4 Tuesday ... 0.001651 0.014035 0.011934 0.001269 0.015256 0.014991 0.000623 -0.010600 0.008709 -0.002694
415 Country_02 Local_09 Metals 1 2 Male Employee Venomous Animals 5 Wednesday ... 0.002174 0.013794 0.011212 0.002034 0.014942 0.015121 0.001151 -0.010728 0.008882 -0.002454
416 Country_02 Local_05 Metals 1 2 Male Employee Cut 6 Thursday ... 0.003302 0.020869 0.016157 0.001890 0.021896 0.022360 0.001045 -0.015450 0.013169 -0.002337
417 Country_01 Local_04 Mining 1 2 Female Contractor Fall prevention (same level) 9 Sunday ... 0.001515 0.011823 0.009894 0.001099 0.012101 0.013122 0.000772 -0.010205 0.007510 -0.001452

418 rows × 314 columns

In [ ]:
df_preprocess1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Country                   418 non-null    object
 1   City                      418 non-null    object
 2   Industry Sector           418 non-null    object
 3   Accident Level            418 non-null    int64 
 4   Potential Accident Level  418 non-null    int64 
 5   Gender                    418 non-null    object
 6   Employee type             418 non-null    object
 7   Critical Risk             418 non-null    object
 8   Day                       418 non-null    int64 
 9   Weekday                   418 non-null    object
 10  WeekofYear                418 non-null    int64 
 11  Weekend                   418 non-null    int64 
 12  Season                    418 non-null    object
 13  Description               418 non-null    object
 14  tokenized_words           418 non-null    object
dtypes: int64(5), object(10)
memory usage: 49.1+ KB
In [ ]:
# Print shapes to confirm
print(Glove_df.shape)
print(TFIDF_df.shape)
print(Word2Vec_df.shape)
(418, 314)
(418, 2372)
(418, 314)

Check for columns with various datatypes in Glove_df, TFIDF_df & Word2Vec_df¶

In [ ]:
for dtype in Glove_df.dtypes.unique():
  print(f"Columns of type {dtype}:")
  print(Glove_df.select_dtypes(include=[dtype]).columns.tolist())
  print()
Columns of type object:
['Country', 'City', 'Industry Sector', 'Gender', 'Employee type', 'Critical Risk', 'Weekday', 'Season', 'Description']

Columns of type int64:
['Accident Level', 'Potential Accident Level', 'Day', 'WeekofYear', 'Weekend']

Columns of type float64:
['GloVe_0', 'GloVe_1', 'GloVe_2', 'GloVe_3', 'GloVe_4', 'GloVe_5', 'GloVe_6', 'GloVe_7', 'GloVe_8', 'GloVe_9', 'GloVe_10', 'GloVe_11', 'GloVe_12', 'GloVe_13', 'GloVe_14', 'GloVe_15', 'GloVe_16', 'GloVe_17', 'GloVe_18', 'GloVe_19', 'GloVe_20', 'GloVe_21', 'GloVe_22', 'GloVe_23', 'GloVe_24', 'GloVe_25', 'GloVe_26', 'GloVe_27', 'GloVe_28', 'GloVe_29', 'GloVe_30', 'GloVe_31', 'GloVe_32', 'GloVe_33', 'GloVe_34', 'GloVe_35', 'GloVe_36', 'GloVe_37', 'GloVe_38', 'GloVe_39', 'GloVe_40', 'GloVe_41', 'GloVe_42', 'GloVe_43', 'GloVe_44', 'GloVe_45', 'GloVe_46', 'GloVe_47', 'GloVe_48', 'GloVe_49', 'GloVe_50', 'GloVe_51', 'GloVe_52', 'GloVe_53', 'GloVe_54', 'GloVe_55', 'GloVe_56', 'GloVe_57', 'GloVe_58', 'GloVe_59', 'GloVe_60', 'GloVe_61', 'GloVe_62', 'GloVe_63', 'GloVe_64', 'GloVe_65', 'GloVe_66', 'GloVe_67', 'GloVe_68', 'GloVe_69', 'GloVe_70', 'GloVe_71', 'GloVe_72', 'GloVe_73', 'GloVe_74', 'GloVe_75', 'GloVe_76', 'GloVe_77', 'GloVe_78', 'GloVe_79', 'GloVe_80', 'GloVe_81', 'GloVe_82', 'GloVe_83', 'GloVe_84', 'GloVe_85', 'GloVe_86', 'GloVe_87', 'GloVe_88', 'GloVe_89', 'GloVe_90', 'GloVe_91', 'GloVe_92', 'GloVe_93', 'GloVe_94', 'GloVe_95', 'GloVe_96', 'GloVe_97', 'GloVe_98', 'GloVe_99', 'GloVe_100', 'GloVe_101', 'GloVe_102', 'GloVe_103', 'GloVe_104', 'GloVe_105', 'GloVe_106', 'GloVe_107', 'GloVe_108', 'GloVe_109', 'GloVe_110', 'GloVe_111', 'GloVe_112', 'GloVe_113', 'GloVe_114', 'GloVe_115', 'GloVe_116', 'GloVe_117', 'GloVe_118', 'GloVe_119', 'GloVe_120', 'GloVe_121', 'GloVe_122', 'GloVe_123', 'GloVe_124', 'GloVe_125', 'GloVe_126', 'GloVe_127', 'GloVe_128', 'GloVe_129', 'GloVe_130', 'GloVe_131', 'GloVe_132', 'GloVe_133', 'GloVe_134', 'GloVe_135', 'GloVe_136', 'GloVe_137', 'GloVe_138', 'GloVe_139', 'GloVe_140', 'GloVe_141', 'GloVe_142', 'GloVe_143', 'GloVe_144', 'GloVe_145', 'GloVe_146', 'GloVe_147', 'GloVe_148', 'GloVe_149', 'GloVe_150', 'GloVe_151', 'GloVe_152', 'GloVe_153', 'GloVe_154', 'GloVe_155', 'GloVe_156', 'GloVe_157', 'GloVe_158', 'GloVe_159', 'GloVe_160', 'GloVe_161', 'GloVe_162', 'GloVe_163', 'GloVe_164', 'GloVe_165', 'GloVe_166', 'GloVe_167', 'GloVe_168', 'GloVe_169', 'GloVe_170', 'GloVe_171', 'GloVe_172', 'GloVe_173', 'GloVe_174', 'GloVe_175', 'GloVe_176', 'GloVe_177', 'GloVe_178', 'GloVe_179', 'GloVe_180', 'GloVe_181', 'GloVe_182', 'GloVe_183', 'GloVe_184', 'GloVe_185', 'GloVe_186', 'GloVe_187', 'GloVe_188', 'GloVe_189', 'GloVe_190', 'GloVe_191', 'GloVe_192', 'GloVe_193', 'GloVe_194', 'GloVe_195', 'GloVe_196', 'GloVe_197', 'GloVe_198', 'GloVe_199', 'GloVe_200', 'GloVe_201', 'GloVe_202', 'GloVe_203', 'GloVe_204', 'GloVe_205', 'GloVe_206', 'GloVe_207', 'GloVe_208', 'GloVe_209', 'GloVe_210', 'GloVe_211', 'GloVe_212', 'GloVe_213', 'GloVe_214', 'GloVe_215', 'GloVe_216', 'GloVe_217', 'GloVe_218', 'GloVe_219', 'GloVe_220', 'GloVe_221', 'GloVe_222', 'GloVe_223', 'GloVe_224', 'GloVe_225', 'GloVe_226', 'GloVe_227', 'GloVe_228', 'GloVe_229', 'GloVe_230', 'GloVe_231', 'GloVe_232', 'GloVe_233', 'GloVe_234', 'GloVe_235', 'GloVe_236', 'GloVe_237', 'GloVe_238', 'GloVe_239', 'GloVe_240', 'GloVe_241', 'GloVe_242', 'GloVe_243', 'GloVe_244', 'GloVe_245', 'GloVe_246', 'GloVe_247', 'GloVe_248', 'GloVe_249', 'GloVe_250', 'GloVe_251', 'GloVe_252', 'GloVe_253', 'GloVe_254', 'GloVe_255', 'GloVe_256', 'GloVe_257', 'GloVe_258', 'GloVe_259', 'GloVe_260', 'GloVe_261', 'GloVe_262', 'GloVe_263', 'GloVe_264', 'GloVe_265', 'GloVe_266', 'GloVe_267', 'GloVe_268', 'GloVe_269', 'GloVe_270', 'GloVe_271', 'GloVe_272', 'GloVe_273', 'GloVe_274', 'GloVe_275', 'GloVe_276', 'GloVe_277', 'GloVe_278', 'GloVe_279', 'GloVe_280', 'GloVe_281', 'GloVe_282', 'GloVe_283', 'GloVe_284', 'GloVe_285', 'GloVe_286', 'GloVe_287', 'GloVe_288', 'GloVe_289', 'GloVe_290', 'GloVe_291', 'GloVe_292', 'GloVe_293', 'GloVe_294', 'GloVe_295', 'GloVe_296', 'GloVe_297', 'GloVe_298', 'GloVe_299']

In [ ]:
for dtype in TFIDF_df.dtypes.unique():
  print(f"Columns of type {dtype}:")
  print(TFIDF_df.select_dtypes(include=[dtype]).columns.tolist())
  print()
Columns of type object:
['Country', 'City', 'Industry Sector', 'Gender', 'Employee type', 'Critical Risk', 'Weekday', 'Season', 'Description']

Columns of type int64:
['Accident Level', 'Potential Accident Level', 'Day', 'WeekofYear', 'Weekend']

Columns of type float64:
['abb', 'abdoman', 'able', 'abratech', 'abrupt', 'abruptly', 'absorb', 'absorbent', 'abutment', 'acc', 'accelerate', 'access', 'accessory', 'accident', 'accidentally', 'accidently', 'accommodate', 'accompany', 'accord', 'accretion', 'accumulate', 'accumulation', 'achieve', 'acid', 'acl', 'acquisition', 'act', 'action', 'activate', 'activation', 'activity', 'actuate', 'adapt', 'adapter', 'addition', 'additive', 'ademir', 'adhere', 'adhesion', 'adjoining', 'adjust', 'adjustment', 'adjutant', 'administrative', 'advance', 'aerial', 'affect', 'affected', 'aforementioned', 'afternoon', 'aggregate', 'agitated', 'ago', 'ahead', 'aid', 'air', 'airlift', 'ajani', 'ajax', 'albertico', 'albino', 'alcohot', 'alert', 'alex', 'alfredo', 'align', 'alimak', 'alimakero', 'alizado', 'allergic', 'allergy', 'allow', 'alpha', 'aluminum', 'ambulance', 'ambulatory', 'amg', 'ammonia', 'ampoloader', 'amputation', 'analysis', 'ancash', 'anchor', 'anchorage', 'anchoring', 'anfo', 'anfoloader', 'angle', 'ankle', 'anode', 'answer', 'antenna', 'anterior', 'anthony', 'anti', 'antiallergic', 'antnio', 'antonio', 'apparent', 'apparently', 'appear', 'apply', 'approach', 'approx', 'approximate', 'approximately', 'aramid', 'arc', 'area', 'aripuan', 'arm', 'arrange', 'arrive', 'ask', 'assemble', 'assembly', 'assign', 'assist', 'assistant', 'assume', 'atenuz', 'atlas', 'atricion', 'atriction', 'attach', 'attack', 'attempt', 'attend', 'attendant', 'attention', 'attribute', 'attrition', 'autoclave', 'automatic', 'auxiliar', 'auxiliary', 'average', 'avoid', 'away', 'b', 'backhoe', 'backwards', 'bag', 'balance', 'balancing', 'ball', 'balloon', 'band', 'bank', 'bap', 'bar', 'barb', 'barbed', 'barel', 'barretilla', 'base', 'basin', 'basket', 'bathroom', 'baton', 'battery', 'beak', 'beam', 'bear', 'bearing', 'beat', 'becker', 'bee', 'beehive', 'beetle', 'begin', 'believe', 'belly', 'belt', 'bench', 'bend', 'bhb', 'big', 'bin', 'bine', 'bioxide', 'bit', 'bite', 'blackjack', 'bladder', 'blade', 'blanket', 'blast', 'blaster', 'blasting', 'blind', 'block', 'blow', 'blower', 'blunt', 'board', 'bob', 'bodeguero', 'body', 'boiler', 'bolt', 'boltec', 'bolter', 'bomb', 'bonifacio', 'bonnet', 'boom', 'boot', 'bore', 'borehole', 'boss', 'bother', 'bottle', 'bounce', 'bowl', 'box', 'bp', 'bra', 'brace', 'bracket', 'brake', 'braking', 'branch', 'brapdd', 'break', 'breaker', 'breeder', 'breno', 'brick', 'bricklayer', 'bridge', 'brigade', 'bring', 'broken', 'bruise', 'brush', 'brushcutter', 'bucket', 'building', 'bump', 'bundle', 'burn', 'burning', 'burr', 'burst', 'bus', 'bypass', 'c', 'cab', 'cabin', 'cabinet', 'cable', 'cadmium', 'cage', 'cajamarquilla', 'calf', 'calibrator', 'call', 'camera', 'camp', 'canario', 'cane', 'canterio', 'canvas', 'cap', 'car', 'carbon', 'cardan', 'care', 'carlos', 'carman', 'carousel', 'carpenter', 'carpentry', 'carry', 'cart', 'carton', 'casionndole', 'cast', 'casting', 'cat', 'catch', 'catheter', 'cathode', 'cathodic', 'cause', 'caustic', 'cave', 'ce', 'ceiling', 'cell', 'cement', 'center', 'central', 'centralizer', 'cep', 'ceremony', 'certain', 'cervical', 'cesar', 'chagua', 'chain', 'chair', 'chamber', 'change', 'channel', 'chapel', 'charge', 'check', 'cheek', 'cheekbone', 'chemical', 'chemo', 'chest', 'chestnut', 'chicken', 'chicoteo', 'chicrin', 'chief', 'chimney', 'chin', 'chirodactile', 'chirodactilo', 'chiropactyl', 'chisel', 'choco', 'choose', 'chop', 'chopping', 'chuck', 'chuquillanqui', 'chute', 'chuteo', 'cia', 'ciliary', 'cinnamon', 'circuit', 'circumstance', 'cite', 'city', 'civil', 'civilian', 'clamp', 'classification', 'claudio', 'clean', 'cleaning', 'clear', 'clearing', 'clerk', 'click', 'climb', 'clinic', 'clockwise', 'clog', 'clogging', 'close', 'closing', 'cloth', 'clothe', 'cluster', 'cm', 'cma', 'co', 'coat', 'cocada', 'cockpit', 'code', 'coil', 'cold', 'collaborator', 'collar', 'colleague', 'collect', 'collection', 'collide', 'combination', 'come', 'comedor', 'comfort', 'command', 'communicate', 'communication', 'company', 'compartment', 'complain', 'complete', 'compose', 'composition', 'compress', 'compressed', 'compressor', 'concentrate', 'concentrator', 'conchucos', 'conclusion', 'concrete', 'concussion', 'conditioning', 'conduct', 'conductive', 'cone', 'confine', 'confipetrol', 'confirm', 'congestion', 'connect', 'connection', 'connector', 'consequence', 'consequently', 'consist', 'construction', 'consult', 'consultant', 'consultation', 'contact', 'contain', 'container', 'containment', 'contaminate', 'contaminated', 'content', 'continue', 'continuously', 'contracture', 'control', 'contusion', 'conveyor', 'convoy', 'cook', 'cooker', 'cooking', 'cool', 'coordinate', 'coordination', 'copilot', 'copla', 'copper', 'cord', 'cormei', 'corner', 'correct', 'correctly', 'correspond', 'corresponding', 'corridor', 'corrugate', 'cosapi', 'costa', 'couple', 'coupling', 'courier', 'cover', 'crack', 'crane', 'crash', 'create', 'crest', 'crew', 'cristbal', 'cristian', 'cro', 'cross', 'crosscutter', 'crossing', 'crouch', 'crown', 'crucible', 'cruise', 'cruiser', 'crumble', 'crush', 'crusher', 'crushing', 'cruz', 'csar', 'cubic', 'cue', 'culminate', 'curl', 'current', 'curve', 'cut', 'cutter', 'cutting', 'cx', 'cycle', 'cyclone', 'cylinder', 'cylindrical', 'da', 'dado', 'damage', 'daniel', 'danillo', 'danon', 'datum', 'day', 'dayme', 'ddh', 'dds', 'de', 'death', 'debarking', 'debris', 'deceased', 'december', 'decide', 'deconcentrate', 'decrease', 'deep', 'deepening', 'defective', 'defensive', 'define', 'degree', 'delivery', 'demag', 'demineralization', 'demister', 'denis', 'depressurisation', 'depth', 'derail', 'derive', 'descend', 'describe', 'design', 'designate', 'deslaminadora', 'deslaminator', 'despite', 'detach', 'detachment', 'detect', 'detector', 'deteriorate', 'detonate', 'detritus', 'develop', 'deviate', 'device', 'diagnose', 'diagnosis', 'diagonal', 'diagonally', 'diamantina', 'diameter', 'diamond', 'diassis', 'die', 'diesel', 'difficult', 'digger', 'dimension', 'dining', 'dioxide', 'direct', 'direction', 'directly', 'disabled', 'disassemble', 'disassembly', 'discharge', 'discomfort', 'disconnect', 'disconnection', 'discover', 'disengage', 'dish', 'disintegrate', 'disk', 'dismantle', 'dismantling', 'displace', 'displacement', 'disposal', 'disrupt', 'distal', 'distance', 'distant', 'distract', 'distribution', 'distributor', 'ditch', 'diversion', 'divert', 'divine', 'divino', 'dizziness', 'do', 'doctor', 'door', 'doosan', 'dosage', 'doser', 'downward', 'drag', 'drain', 'drainage', 'draw', 'drawer', 'drill', 'driller', 'drilling', 'drive', 'driver', 'drop', 'drum', 'dry', 'duct', 'dump', 'dumper', 'dune', 'dust', 'duty', 'duval', 'e', 'ear', 'earth', 'earthenware', 'easel', 'east', 'edge', 'eduardo', 'ee', 'effect', 'effective', 'effort', 'efran', 'eissa', 'ejecting', 'eka', 'el', 'elbow', 'electric', 'electrical', 'electrician', 'electro', 'electrolysis', 'electrolyte', 'electrometallurgy', 'electrowelded', 'element', 'elevation', 'eliseo', 'elismar', 'ematoma', 'embed', 'emergency', 'emerson', 'employee', 'empresa', 'emptiness', 'emptying', 'emulsion', 'enabled', 'encounter', 'end', 'endure', 'energize', 'energized', 'energy', 'enforce', 'engage', 'engine', 'engineer', 'enmicadas', 'enoc', 'ensure', 'enter', 'entire', 'entrance', 'entry', 'environment', 'environmental', 'epis', 'epp', 'epps', 'equally', 'equipment', 'erasing', 'eric', 'eriks', 'escape', 'esengrasante', 'estimate', 'estriping', 'eusbio', 'eustaquio', 'evacuate', 'evacuation', 'evaluate', 'evaluation', 'evaporator', 'event', 'ex', 'examination', 'excavate', 'excavation', 'excavator', 'excess', 'excessive', 'exchange', 'exchanger', 'excited', 'excoriation', 'execution', 'exert', 'existence', 'exit', 'expansion', 'expedition', 'expel', 'explode', 'explomin', 'explosion', 'explosive', 'expose', 'extension', 'external', 'extra', 'extract', 'extraction', 'extruder', 'eye', 'eyebolt', 'eyebrow', 'eyelash', 'eyelet', 'eyelid', 'eyewash', 'f', 'fabio', 'fabric', 'face', 'facial', 'facila', 'facilitate', 'facility', 'fact', 'factory', 'fail', 'failure', 'faintness', 'fall', 'falling', 'false', 'fan', 'fanel', 'fanele', 'farm', 'fasten', 'faucet', 'favor', 'fbio', 'feast', 'fectuaban', 'feed', 'feeder', 'feel', 'feeling', 'felipe', 'felix', 'fence', 'fender', 'fernando', 'fernndez', 'ferranta', 'fiberglass', 'field', 'fifth', 'figure', 'fill', 'filling', 'filter', 'filtration', 'final', 'finally', 'find', 'finding', 'fine', 'finger', 'finish', 'fire', 'firmly', 'fish', 'fisherman', 'fissure', 'fit', 'fix', 'fixed', 'fixing', 'flammable', 'flange', 'flash', 'flat', 'flex', 'flexible', 'floor', 'flotation', 'flow', 'flyght', 'foam', 'fogging', 'folder', 'foliage', 'follow', 'food', 'foot', 'footwear', 'fop', 'force', 'forearm', 'forehead', 'foreman', 'forest', 'forklift', 'form', 'formation', 'forward', 'foundry', 'fourth', 'fracture', 'fragment', 'fragmento', 'frame', 'francisco', 'frank', 'freddy', 'free', 'friction', 'fright', 'frightened', 'frontal', 'frontally', 'fruit', 'fuel', 'fulcrum', 'fully', 'functioning', 'funnel', 'furnace', 'fuse', 'future', 'g', 'gable', 'gallery', 'gallon', 'gap', 'garit', 'garrote', 'gas', 'gate', 'gauge', 'gaze', 'gear', 'gearbox', 'geho', 'general', 'generate', 'geological', 'geologist', 'geologo', 'geology', 'geomembrane', 'georli', 'geosol', 'get', 'getting', 'gift', 'gilton', 'gilvnio', 'girdle', 'give', 'glass', 'glove', 'go', 'goat', 'goggle', 'good', 'gps', 'gr', 'grab', 'gram', 'granja', 'grate', 'grating', 'gravel', 'graze', 'great', 'grid', 'griff', 'grille', 'grind', 'grinder', 'ground', 'group', 'grp', 'grs', 'gts', 'guard', 'guide', 'guillotine', 'gun', 'gutter', 'h', 'habilitation', 'half', 'hammer', 'hand', 'handle', 'handrail', 'hang', 'happen', 'harden', 'harness', 'hastial', 'hat', 'hatch', 'haul', 'have', 'having', 'hdp', 'hdpe', 'head', 'headlight', 'health', 'hear', 'heat', 'heated', 'heavy', 'heel', 'height', 'helical', 'helmet', 'help', 'helper', 'hematoma', 'hemiface', 'hexagonal', 'hiab', 'hidalgo', 'high', 'highway', 'hill', 'hinge', 'hip', 'hiss', 'hit', 'hitchhike', 'hoe', 'hoist', 'hoisting', 'hold', 'holder', 'holding', 'hole', 'hood', 'hook', 'hopper', 'horizontal', 'horizontally', 'horse', 'hose', 'hospital', 'hot', 'hour', 'house', 'housing', 'hq', 'hrs', 'humped', 'hurry', 'hw', 'hycron', 'hydraulic', 'hydrojet', 'hydroxide', 'hyt', 'ice', 'identify', 'iglu', 'ignite', 'igor', 'ii', 'iii', 'illness', 'imbalance', 'immediate', 'immediately', 'impact', 'impacting', 'importance', 'impregnate', 'imprison', 'imprisonment', 'impromec', 'improve', 'incentration', 'inch', 'inchancable', 'inchancanble', 'incident', 'incimet', 'incimmet', 'inclination', 'inclined', 'include', 'increase', 'index', 'indicate', 'industrial', 'inefficacy', 'inertia', 'inferior', 'inform', 'infrastructure', 'ingot', 'initial', 'initiate', 'injection', 'injure', 'injured', 'injury', 'inlet', 'inner', 'insect', 'insertion', 'inside', 'inspect', 'inspection', 'instal', 'install', 'installation', 'instant', 'instep', 'instruct', 'insulation', 'intense', 'intention', 'interior', 'interlace', 'interlaced', 'intermediate', 'internal', 'intersection', 'inthinc', 'introduce', 'invade', 'investigation', 'involuntarily', 'involve', 'involved', 'inward', 'ip', 'iron', 'ironing', 'irritation', 'isc', 'isidro', 'isolate', 'ith', 'iv', 'jaba', 'jack', 'jacket', 'jackleg', 'jaw', 'jehovah', 'jehovnio', 'jesus', 'jet', 'jetanol', 'jhon', 'jhonatan', 'jhony', 'jib', 'jka', 'job', 'joint', 'jos', 'jose', 'josimar', 'juan', 'julio', 'july', 'jumbo', 'jump', 'juna', 'junior', 'juveni', 'keep', 'kelly', 'kevin', 'key', 'keypad', 'kg', 'kick', 'killer', 'kiln', 'kitchen', 'km', 'knee', 'kneel', 'kneeling', 'knife', 'know', 'knuckle', 'kv', 'la', 'label', 'labeling', 'labor', 'laboratory', 'laceration', 'lack', 'ladder', 'laden', 'lady', 'lajes', 'laminator', 'lamp', 'lance', 'lane', 'laquia', 'large', 'lash', 'late', 'later', 'lateral', 'laterally', 'launch', 'launcher', 'laundry', 'lavra', 'lay', 'leach', 'leaching', 'lead', 'leader', 'leak', 'leakage', 'lean', 'leandro', 'leather', 'leave', 'lectro', 'left', 'leg', 'legging', 'lemon', 'length', 'lens', 'lense', 'lesion', 'leucena', 'level', 'lever', 'lhd', 'liana', 'license', 'lid', 'lie', 'lifeline', 'lift', 'lifting', 'light', 'lighthouse', 'like', 'liliana', 'lima', 'limb', 'lime', 'line', 'lineman', 'lining', 'link', 'lip', 'liquid', 'list', 'lit', 'liter', 'litorina', 'litter', 'little', 'load', 'loaded', 'loader', 'loading', 'local', 'localize', 'localized', 'locate', 'location', 'lock', 'locker', 'locking', 'locomotive', 'lodge', 'long', 'look', 'lookout', 'loose', 'loosen', 'lose', 'loud', 'low', 'lower', 'lt', 'ltda', 'lubricant', 'lubricate', 'lubrication', 'lubricator', 'lucas', 'luciano', 'luis', 'luiz', 'lumbar', 'luna', 'lunch', 'lung', 'luxo', 'lx', 'lying', 'lyner', 'lzaro', 'm', 'macedonio', 'machete', 'machine', 'machinery', 'maestranza', 'mag', 'magazine', 'magnetometer', 'magnetometric', 'maid', 'main', 'maintain', 'maintenance', 'make', 'mallet', 'man', 'manage', 'management', 'manco', 'manetometer', 'maneuver', 'mangote', 'manhole', 'manifest', 'manifestation', 'manipulate', 'manipulation', 'manipulator', 'manitou', 'manoel', 'manual', 'manually', 'manuel', 'maperu', 'mapping', 'marble', 'marcelo', 'marco', 'marcos', 'marcy', 'maribondo', 'marimbondo', 'mario', 'mark', 'marking', 'martinpole', 'mask', 'maslucan', 'mason', 'master', 'mat', 'mata', 'material', 'maximum', 'mceisa', 'mean', 'measure', 'measurement', 'measuring', 'mechanic', 'mechanical', 'mechanized', 'medical', 'medicate', 'medicine', 'melt', 'member', 'mesh', 'messr', 'messrs', 'metal', 'metallic', 'metatarsal', 'meter', 'middle', 'miguel', 'mild', 'mill', 'milling', 'milpo', 'milton', 'mina', 'mince', 'mine', 'mineral', 'mini', 'mining', 'minor', 'minute', 'misalignment', 'miss', 'mix', 'mixed', 'mixer', 'mixkret', 'mixture', 'ml', 'mobile', 'module', 'moinsac', 'mollare', 'mollares', 'moment', 'mona', 'monitor', 'monitoring', 'monkey', 'month', 'moon', 'mooring', 'morais', 'mortar', 'moth', 'motion', 'motor', 'motorist', 'mount', 'mouth', 'move', 'movement', 'mr', 'mrcio', 'mrio', 'mt', 'mud', 'municipal', 'murilo', 'muscle', 'n', 'nail', 'nascimento', 'natclar', 'near', 'nearby', 'necessary', 'neck', 'need', 'needle', 'negative', 'neglect', 'neutral', 'new', 'night', 'nilton', 'nipple', 'nitric', 'noise', 'non', 'normal', 'normally', 'north', 'nose', 'note', 'notebook', 'notice', 'noticing', 'novo', 'nozzle', 'nq', 'nro', 'nut', 'nv', 'nylon', 'ob', 'object', 'observe', 'obstruct', 'obstruction', 'occupant', 'occur', 'office', 'official', 'oil', 'old', 'ompressor', 'one', 'op', 'open', 'opening', 'operate', 'operating', 'operation', 'operational', 'operator', 'opposite', 'orange', 'order', 'ordinary', 'ore', 'originate', 'orlando', 'oscillation', 'osorio', 'outcrop', 'outlet', 'outpatient', 'outside', 'oven', 'overall', 'overcome', 'overexertion', 'overflow', 'overhang', 'overhead', 'overheating', 'overlap', 'overpressure', 'overturn', 'oxicorte', 'oxide', 'oxyfuel', 'pablo', 'pack', 'package', 'pad', 'page', 'pain', 'paint', 'painting', 'palm', 'panel', 'pant', 'paracatu', 'paralysis', 'paralyze', 'park', 'parking', 'part', 'partially', 'participate', 'particle', 'partner', 'pasco', 'pass', 'passage', 'paste', 'pasture', 'path', 'patrol', 'patronal', 'paulo', 'pause', 'pay', 'pb', 'pead', 'pear', 'pedal', 'pedestal', 'pedro', 'peel', 'pen', 'pendulum', 'pentacord', 'penultimate', 'people', 'perceive', 'percussion', 'perforation', 'perform', 'performer', 'period', 'peristaltic', 'person', 'personal', 'personnel', 'phalanx', 'phase', 'photo', 'photograph', 'physician', 'pick', 'pickaxe', 'pickup', 'piece', 'pierce', 'pig', 'pillar', 'pilot', 'pin', 'pink', 'pipe', 'pipeline', 'pipette', 'piping', 'pique', 'piquero', 'piston', 'pit', 'pivot', 'place', 'placement', 'placing', 'planamieto', 'planning', 'plant', 'plastic', 'plate', 'platform', 'play', 'plug', 'pm', 'pneumatic', 'pocket', 'point', 'pointed', 'pole', 'polling', 'polyethylene', 'polymer', 'polyontusion', 'polypropylene', 'polyurethane', 'pom', 'ponchos', 'porangatu', 'portable', 'portion', 'porvenir', 'position', 'positioning', 'positive', 'possible', 'possibly', 'post', 'pot', 'potion', 'pound', 'pour', 'povoado', 'powder', 'power', 'ppe', 'pre', 'preparation', 'prepare', 'prescribing', 'presence', 'present', 'press', 'pressure', 'prevent', 'preventive', 'previous', 'previously', 'prick', 'pril', 'primary', 'probe', 'problem', 'procedure', 'proceed', 'proceeding', 'process', 'produce', 'product', 'production', 'profile', 'progress', 'progressive', 'proingcom', 'project', 'projection', 'promptly', 'prong', 'propeller', 'properly', 'propicindose', 'prospector', 'protection', 'protective', 'protector', 'protrude', 'protruding', 'provoke', 'proximal', 'psi', 'public', 'puddle', 'pull', 'pulley', 'pulp', 'pulpomatic', 'pump', 'pumping', 'purification', 'push', 'put', 'putty', 'pvc', 'pyrotechnic', 'queneche', 'quickly', 'quinoa', 'quirodactilo', 'quirodactyl', 'rack', 'radial', 'radiator', 'radio', 'radius', 'rafael', 'rag', 'rail', 'railing', 'railway', 'raise', 'rake', 'ramp', 'rampa', 'rapid', 'raspndose', 'raul', 'ravine', 'ray', 'rb', 'reach', 'react', 'reaction', 'reactive', 'readjust', 'realize', 'rear', 'reason', 'rebound', 'receive', 'recently', 'reception', 'reciprocate', 'reconnaissance', 'recovery', 'redness', 'reduce', 'reducer', 'reduction', 'reel', 'refer', 'reference', 'reflux', 'refractory', 'refrigerant', 'refuge', 'refurbishment', 'region', 'register', 'reinforce', 'reinstallation', 'release', 'remain', 'remedy', 'removal', 'remove', 'renato', 'repair', 'replace', 'report', 'reposition', 'represent', 'repulpe', 'request', 'require', 'resane', 'rescue', 'research', 'reserve', 'reshape', 'residence', 'resident', 'residual', 'residue', 'resin', 'resistance', 'respective', 'respirator', 'respond', 'response', 'responsible', 'rest', 'restart', 'restrict', 'result', 'retire', 'retract', 'retraction', 'retreat', 'return', 'revegetation', 'reverse', 'review', 'rhainer', 'rhyming', 'ribbon', 'rice', 'ride', 'rig', 'rigger', 'right', 'rim', 'ring', 'rip', 'ripp', 'ripper', 'rise', 'risk', 'rivet', 'rlc', 'road', 'robot', 'robson', 'rock', 'rocker', 'rod', 'roger', 'rolando', 'roll', 'roller', 'rollover', 'romn', 'ronald', 'roof', 'room', 'rop', 'rope', 'rotary', 'rotate', 'rotation', 'rotor', 'routine', 'row', 'roy', 'rp', 'rpa', 'rub', 'rubber', 'rugged', 'run', 'rung', 'rupture', 'rush', 'sacrifice', 'sacrificial', 'saddle', 'safe', 'safety', 'said', 'sailor', 'sample', 'sampler', 'samuel', 'sand', 'sanitation', 'santa', 'santo', 'sardinel', 'saturate', 'saw', 'say', 'scaffold', 'scaffolding', 'scaler', 'scaller', 'scalp', 'scare', 'sccop', 'schedule', 'scissor', 'scoop', 'scooptram', 'scoria', 'scorpion', 'scrap', 'scraper', 'screen', 'screw', 'screwdriver', 'scruber', 'seal', 'sealing', 'seam', 'seat', 'seatbelt', 'second', 'secondary', 'section', 'sectioned', 'secure', 'security', 'sediment', 'sedimentation', 'see', 'seek', 'segment', 'semi', 'sensation', 'sensor', 'september', 'serra', 'servant', 'serve', 'service', 'servitecforaco', 'set', 'setting', 'settle', 'seven', 'sf', 'shaft', 'shake', 'shallow', 'shank', 'shape', 'share', 'sharply', 'shear', 'sheepskin', 'sheet', 'shell', 'shield', 'shift', 'shine', 'shipment', 'shipper', 'shipping', 'shirt', 'shock', 'shocrete', 'shoe', 'shoot', 'short', 'shorten', 'shot', 'shotcrete', 'shotcreteado', 'shotcreterepentinamente', 'shoulder', 'shovel', 'show', 'shower', 'shutter', 'shuttering', 'sickle', 'side', 'siemag', 'signal', 'signaling', 'silicate', 'silo', 'silva', 'silver', 'simba', 'simultaneously', 'sink', 'sip', 'sit', 'site', 'situation', 'size', 'sketched', 'skid', 'skimmer', 'skin', 'skip', 'slab', 'slag', 'slaughter', 'sledgehammer', 'sleeper', 'sleeve', 'slide', 'sliding', 'slight', 'slightly', 'slimme', 'sling', 'slip', 'slippery', 'slope', 'slow', 'sludge', 'small', 'snack', 'snake', 'so', 'socket', 'socorro', 'soda', 'sodium', 'soft', 'soil', 'soldering', 'sole', 'solid', 'solubilization', 'solution', 'soon', 'soquet', 'sound', 'south', 'space', 'span', 'spare', 'spark', 'spatter', 'spatula', 'spear', 'speart', 'specific', 'specify', 'spend', 'spike', 'spill', 'spillway', 'spin', 'spine', 'splash', 'splinter', 'split', 'spoiler', 'spool', 'spoon', 'sprain', 'spume', 'square', 'squat', 'squatting', 'srgio', 'ssomac', 'st', 'sta', 'stability', 'stabilize', 'stabilizer', 'stack', 'stacker', 'stacking', 'staff', 'stage', 'stair', 'staircase', 'stake', 'stand', 'standardization', 'start', 'starter', 'state', 'station', 'steam', 'steel', 'steep', 'steering', 'stem', 'step', 'stepladder', 'stick', 'stilson', 'sting', 'stir', 'stirrup', 'stitch', 'stone', 'stool', 'stoop', 'stop', 'stope', 'stoppage', 'stopper', 'storage', 'store', 'storm', 'stp', 'straight', 'strain', 'strap', 'street', 'strength', 'stretch', 'stretcher', 'strike', 'strip', 'stroke', 'strong', 'structure', 'strut', 'stumble', 'stump', 'stun', 'stylet', 'sub', 'subjection', 'submerge', 'subsequent', 'subsequently', 'success', 'suction', 'sudden', 'suddenly', 'suffer', 'suffering', 'suitably', 'sul', 'sulfate', 'sulfide', 'sulfur', 'sulfuric', 'sulphate', 'sulphide', 'sump', 'sunday', 'sunglass', 'superciliary', 'superficial', 'superficially', 'superior', 'supervise', 'supervision', 'supervisor', 'supervisory', 'supply', 'support', 'surcharge', 'sure', 'surface', 'surprise', 'surround', 'survey', 'surveying', 'suspend', 'suspender', 'sustain', 'sustained', 'suture', 'swarm', 'swathe', 'sweep', 'swell', 'swelling', 'swing', 'switch', 'symptom', 'system', 't', 'table', 'tabola', 'tabolas', 'tail', 'tailing', 'tajo', 'take', 'talus', 'tangle', 'tank', 'tanker', 'tap', 'tape', 'taque', 'target', 'task', 'taut', 'teacher', 'team', 'teammate', 'tear', 'tearing', 'technical', 'technician', 'tecl', 'tecla', 'tecle', 'tecnomin', 'telescopic', 'tell', 'tello', 'temporarily', 'temporary', 'tension', 'tenth', 'test', 'testimony', 'tether', 'thermal', 'thermomagnetic', 'thickener', 'thickness', 'thigh', 'thin', 'thinner', 'thorax', 'thorn', 'thread', 'throw', 'throwing', 'thrust', 'thug', 'thumb', 'thunderous', 'tick', 'tie', 'tighten', 'tilt', 'time', 'timely', 'tip', 'tipper', 'tire', 'tirfor', 'tirford', 'tito', 'tj', 'tk', 'tm', 'tn', 'toe', 'toecap', 'toilet', 'ton', 'tool', 'topographic', 'torch', 'torque', 'torre', 'total', 'touch', 'tour', 'tower', 'toxicity', 'toy', 'tq', 'tqs', 'track', 'tractor', 'trailer', 'trainee', 'tranfer', 'tranquera', 'transfe', 'transfer', 'transformer', 'transit', 'transmission', 'transport', 'transverse', 'transversely', 'trap', 'trauma', 'traumatic', 'traumatism', 'travel', 'traverse', 'tray', 'tread', 'treat', 'treatment', 'tree', 'trellex', 'trench', 'trestle', 'triangular', 'trip', 'truck', 'try', 'tube', 'tubing', 'tubo', 'tucum', 'tunel', 'tunnel', 'turn', 'turntable', 'twice', 'twist', 'twisting', 'tying', 'type', 'tyrfor', 'unbalance', 'unbalanced', 'unclog', 'uncoupled', 'uncover', 'undergo', 'underground', 'uneven', 'unevenness', 'unexpectedly', 'unhook', 'unicon', 'uniform', 'union', 'unit', 'unleashing', 'unload', 'unloading', 'unlock', 'unlocking', 'unscrew', 'unstable', 'untie', 'untimely', 'upper', 'upward', 'upwards', 'use', 'ustulacin', 'ustulado', 'ustulador', 'ustulation', 'usual', 'utensil', 'vacuum', 'valve', 'van', 'vanish', 'vazante', 'vegetation', 'vehicle', 'ventilation', 'verification', 'verifie', 'verify', 'vertical', 'vertically', 'vial', 'victalica', 'victim', 'victor', 'vieira', 'vine', 'violent', 'violently', 'virdro', 'visibility', 'vision', 'visit', 'vista', 'visual', 'visualize', 'vitaulic', 'vms', 'void', 'voltage', 'volumetric', 'volvo', 'vsd', 'waelz', 'wagon', 'wait', 'walk', 'wall', 'walrus', 'walter', 'want', 'warehouse', 'warley', 'warman', 'warning', 'warp', 'warrin', 'wash', 'washing', 'wasp', 'waste', 'watch', 'water', 'watered', 'watermelon', 'wax', 'way', 'wca', 'weakly', 'wear', 'wedge', 'weed', 'weevil', 'weigh', 'weight', 'weld', 'welder', 'welding', 'wellfield', 'west', 'wet', 'wheel', 'wheelbarrow', 'whiplash', 'whistle', 'wick', 'wide', 'width', 'wila', 'wilber', 'wilder', 'william', 'willing', 'wilmer', 'winch', 'winche', 'window', 'winemaker', 'winery', 'wire', 'withdraw', 'withdrawal', 'woman', 'wood', 'wooden', 'work', 'worker', 'workplace', 'workshop', 'wound', 'wrench', 'wrist', 'x', 'xixs', 'xrd', 'xxx', 'yaranga', 'yard', 'ydr', 'yield', 'yolk', 'young', 'zaf', 'zamac', 'zero', 'zinc', 'zinco', 'zn', 'zone']

In [ ]:
for dtype in Word2Vec_df.dtypes.unique():
  print(f"Columns of type {dtype}:")
  print(Word2Vec_df.select_dtypes(include=[dtype]).columns.tolist())
  print()
Columns of type object:
['Country', 'City', 'Industry Sector', 'Gender', 'Employee type', 'Critical Risk', 'Weekday', 'Season', 'Description']

Columns of type int64:
['Accident Level', 'Potential Accident Level', 'Day', 'WeekofYear', 'Weekend']

Columns of type float32:
['Word2Vec_0', 'Word2Vec_1', 'Word2Vec_2', 'Word2Vec_3', 'Word2Vec_4', 'Word2Vec_5', 'Word2Vec_6', 'Word2Vec_7', 'Word2Vec_8', 'Word2Vec_9', 'Word2Vec_10', 'Word2Vec_11', 'Word2Vec_12', 'Word2Vec_13', 'Word2Vec_14', 'Word2Vec_15', 'Word2Vec_16', 'Word2Vec_17', 'Word2Vec_18', 'Word2Vec_19', 'Word2Vec_20', 'Word2Vec_21', 'Word2Vec_22', 'Word2Vec_23', 'Word2Vec_24', 'Word2Vec_25', 'Word2Vec_26', 'Word2Vec_27', 'Word2Vec_28', 'Word2Vec_29', 'Word2Vec_30', 'Word2Vec_31', 'Word2Vec_32', 'Word2Vec_33', 'Word2Vec_34', 'Word2Vec_35', 'Word2Vec_36', 'Word2Vec_37', 'Word2Vec_38', 'Word2Vec_39', 'Word2Vec_40', 'Word2Vec_41', 'Word2Vec_42', 'Word2Vec_43', 'Word2Vec_44', 'Word2Vec_45', 'Word2Vec_46', 'Word2Vec_47', 'Word2Vec_48', 'Word2Vec_49', 'Word2Vec_50', 'Word2Vec_51', 'Word2Vec_52', 'Word2Vec_53', 'Word2Vec_54', 'Word2Vec_55', 'Word2Vec_56', 'Word2Vec_57', 'Word2Vec_58', 'Word2Vec_59', 'Word2Vec_60', 'Word2Vec_61', 'Word2Vec_62', 'Word2Vec_63', 'Word2Vec_64', 'Word2Vec_65', 'Word2Vec_66', 'Word2Vec_67', 'Word2Vec_68', 'Word2Vec_69', 'Word2Vec_70', 'Word2Vec_71', 'Word2Vec_72', 'Word2Vec_73', 'Word2Vec_74', 'Word2Vec_75', 'Word2Vec_76', 'Word2Vec_77', 'Word2Vec_78', 'Word2Vec_79', 'Word2Vec_80', 'Word2Vec_81', 'Word2Vec_82', 'Word2Vec_83', 'Word2Vec_84', 'Word2Vec_85', 'Word2Vec_86', 'Word2Vec_87', 'Word2Vec_88', 'Word2Vec_89', 'Word2Vec_90', 'Word2Vec_91', 'Word2Vec_92', 'Word2Vec_93', 'Word2Vec_94', 'Word2Vec_95', 'Word2Vec_96', 'Word2Vec_97', 'Word2Vec_98', 'Word2Vec_99', 'Word2Vec_100', 'Word2Vec_101', 'Word2Vec_102', 'Word2Vec_103', 'Word2Vec_104', 'Word2Vec_105', 'Word2Vec_106', 'Word2Vec_107', 'Word2Vec_108', 'Word2Vec_109', 'Word2Vec_110', 'Word2Vec_111', 'Word2Vec_112', 'Word2Vec_113', 'Word2Vec_114', 'Word2Vec_115', 'Word2Vec_116', 'Word2Vec_117', 'Word2Vec_118', 'Word2Vec_119', 'Word2Vec_120', 'Word2Vec_121', 'Word2Vec_122', 'Word2Vec_123', 'Word2Vec_124', 'Word2Vec_125', 'Word2Vec_126', 'Word2Vec_127', 'Word2Vec_128', 'Word2Vec_129', 'Word2Vec_130', 'Word2Vec_131', 'Word2Vec_132', 'Word2Vec_133', 'Word2Vec_134', 'Word2Vec_135', 'Word2Vec_136', 'Word2Vec_137', 'Word2Vec_138', 'Word2Vec_139', 'Word2Vec_140', 'Word2Vec_141', 'Word2Vec_142', 'Word2Vec_143', 'Word2Vec_144', 'Word2Vec_145', 'Word2Vec_146', 'Word2Vec_147', 'Word2Vec_148', 'Word2Vec_149', 'Word2Vec_150', 'Word2Vec_151', 'Word2Vec_152', 'Word2Vec_153', 'Word2Vec_154', 'Word2Vec_155', 'Word2Vec_156', 'Word2Vec_157', 'Word2Vec_158', 'Word2Vec_159', 'Word2Vec_160', 'Word2Vec_161', 'Word2Vec_162', 'Word2Vec_163', 'Word2Vec_164', 'Word2Vec_165', 'Word2Vec_166', 'Word2Vec_167', 'Word2Vec_168', 'Word2Vec_169', 'Word2Vec_170', 'Word2Vec_171', 'Word2Vec_172', 'Word2Vec_173', 'Word2Vec_174', 'Word2Vec_175', 'Word2Vec_176', 'Word2Vec_177', 'Word2Vec_178', 'Word2Vec_179', 'Word2Vec_180', 'Word2Vec_181', 'Word2Vec_182', 'Word2Vec_183', 'Word2Vec_184', 'Word2Vec_185', 'Word2Vec_186', 'Word2Vec_187', 'Word2Vec_188', 'Word2Vec_189', 'Word2Vec_190', 'Word2Vec_191', 'Word2Vec_192', 'Word2Vec_193', 'Word2Vec_194', 'Word2Vec_195', 'Word2Vec_196', 'Word2Vec_197', 'Word2Vec_198', 'Word2Vec_199', 'Word2Vec_200', 'Word2Vec_201', 'Word2Vec_202', 'Word2Vec_203', 'Word2Vec_204', 'Word2Vec_205', 'Word2Vec_206', 'Word2Vec_207', 'Word2Vec_208', 'Word2Vec_209', 'Word2Vec_210', 'Word2Vec_211', 'Word2Vec_212', 'Word2Vec_213', 'Word2Vec_214', 'Word2Vec_215', 'Word2Vec_216', 'Word2Vec_217', 'Word2Vec_218', 'Word2Vec_219', 'Word2Vec_220', 'Word2Vec_221', 'Word2Vec_222', 'Word2Vec_223', 'Word2Vec_224', 'Word2Vec_225', 'Word2Vec_226', 'Word2Vec_227', 'Word2Vec_228', 'Word2Vec_229', 'Word2Vec_230', 'Word2Vec_231', 'Word2Vec_232', 'Word2Vec_233', 'Word2Vec_234', 'Word2Vec_235', 'Word2Vec_236', 'Word2Vec_237', 'Word2Vec_238', 'Word2Vec_239', 'Word2Vec_240', 'Word2Vec_241', 'Word2Vec_242', 'Word2Vec_243', 'Word2Vec_244', 'Word2Vec_245', 'Word2Vec_246', 'Word2Vec_247', 'Word2Vec_248', 'Word2Vec_249', 'Word2Vec_250', 'Word2Vec_251', 'Word2Vec_252', 'Word2Vec_253', 'Word2Vec_254', 'Word2Vec_255', 'Word2Vec_256', 'Word2Vec_257', 'Word2Vec_258', 'Word2Vec_259', 'Word2Vec_260', 'Word2Vec_261', 'Word2Vec_262', 'Word2Vec_263', 'Word2Vec_264', 'Word2Vec_265', 'Word2Vec_266', 'Word2Vec_267', 'Word2Vec_268', 'Word2Vec_269', 'Word2Vec_270', 'Word2Vec_271', 'Word2Vec_272', 'Word2Vec_273', 'Word2Vec_274', 'Word2Vec_275', 'Word2Vec_276', 'Word2Vec_277', 'Word2Vec_278', 'Word2Vec_279', 'Word2Vec_280', 'Word2Vec_281', 'Word2Vec_282', 'Word2Vec_283', 'Word2Vec_284', 'Word2Vec_285', 'Word2Vec_286', 'Word2Vec_287', 'Word2Vec_288', 'Word2Vec_289', 'Word2Vec_290', 'Word2Vec_291', 'Word2Vec_292', 'Word2Vec_293', 'Word2Vec_294', 'Word2Vec_295', 'Word2Vec_296', 'Word2Vec_297', 'Word2Vec_298', 'Word2Vec_299']

Label encode Accident level and Potential Accident Level in all the 3 dataframes

In [ ]:
from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Encode 'Accident Level' and 'Potential Accident Level' in Glove_df
Glove_df['Accident Level'] = label_encoder.fit_transform(Glove_df['Accident Level'])
Glove_df['Potential Accident Level'] = label_encoder.fit_transform(Glove_df['Potential Accident Level'])

# Encode 'Accident Level' and 'Potential Accident Level' in TFIDF_df
TFIDF_df['Accident Level'] = label_encoder.fit_transform(TFIDF_df['Accident Level'])
TFIDF_df['Potential Accident Level'] = label_encoder.fit_transform(TFIDF_df['Potential Accident Level'])

# Encode 'Accident Level' and 'Potential Accident Level' in Word2Vec_df
Word2Vec_df['Accident Level'] = label_encoder.fit_transform(Word2Vec_df['Accident Level'])
Word2Vec_df['Potential Accident Level'] = label_encoder.fit_transform(Word2Vec_df['Potential Accident Level'])
In [ ]:
# Export to Intermediate Excel File to Drive, later to build the Model2 using "Potential_Accident_level"
Glove_df.to_excel('/content/drive/MyDrive/AIML_Capstone_Project/Intermediate_NLP_Glove_df.xlsx', index=False)
TFIDF_df.to_excel('/content/drive/MyDrive/AIML_Capstone_Project/Intermediate_NLP_TFIDF_df.xlsx', index=False)
Word2Vec_df.to_excel('/content/drive/MyDrive/AIML_Capstone_Project/Intermediate_NLP_Word2Vec_df.xlsx', index=False)
In [ ]:
# Columns to drop
columns_to_drop = ['Day', 'Potential Accident Level', 'Description']

# Drop columns from each DataFrame
Glove_df = Glove_df.drop(columns_to_drop, axis=1)
TFIDF_df = TFIDF_df.drop(columns_to_drop, axis=1)
Word2Vec_df = Word2Vec_df.drop(columns_to_drop, axis=1)
In [ ]:
# Calculate target variable distribution for each DataFrame
glove_target_dist = Glove_df['Accident Level'].value_counts(normalize=False)
tfidf_target_dist = TFIDF_df['Accident Level'].value_counts(normalize=False)
word2vec_target_dist = Word2Vec_df['Accident Level'].value_counts(normalize=False)

# Create a DataFrame to display the distributions
target_distribution_df = pd.DataFrame({
    'Glove': glove_target_dist,
    'TF-IDF': tfidf_target_dist,
    'Word2Vec': word2vec_target_dist
})

# Print the DataFrame
target_distribution_df
Out[ ]:
Glove TF-IDF Word2Vec
Accident Level
0 309 309 309
1 40 40 40
2 31 31 31
3 30 30 30
4 8 8 8

Observations: Target Variable Distribution:

Across all three embedding methods (GloVe, TF-IDF, Word2Vec), the distribution of the target variable "Accident Level" remains consistent. This indicates that the embedding process itself doesn't significantly alter the representation of the target variable. The majority of instances fall under a specific "Accident Level" (likely the most common type of accident), highlighting the imbalanced nature of the dataset. Implications for Modeling:

The imbalanced target distribution suggests the need for addressing class imbalance during model training. Techniques like oversampling, undersampling, or using weighted loss functions might be necessary to improve model performance on minority classes. Careful evaluation metrics (precision, recall, F1-score) should be used to assess model performance on all classes, not just the majority class.

In [ ]:
!pip install imblearn
Collecting imblearn
  Downloading imblearn-0.0-py2.py3-none-any.whl.metadata (355 bytes)
Requirement already satisfied: imbalanced-learn in /usr/local/lib/python3.10/dist-packages (from imblearn) (0.12.4)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.26.4)
Requirement already satisfied: scipy>=1.5.0 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.13.1)
Requirement already satisfied: scikit-learn>=1.0.2 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.5.2)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.4.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (3.5.0)
Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Installing collected packages: imblearn
Successfully installed imblearn-0.0
In [ ]:
# Balance 'Accident Level' using SMOTE. for all the 3 dataframes.
# Converting categorical features to numerical using one-hot encoding

import pandas as pd
from imblearn.over_sampling import SMOTE

# Function to balance data and one-hot encode categorical features
def balance_and_encode(df):
  # Separate features and target variable
  X = df.drop('Accident Level', axis=1)
  y = df['Accident Level']

  # One-hot encode categorical features (if any)
  categorical_features = X.select_dtypes(include=['object']).columns
  if categorical_features.any():
    X_encoded = pd.get_dummies(X, columns=categorical_features, dtype=int, drop_first=True)
  else:
    X_encoded = X

  # One-hot encode 'DayOfWeek'
  #X_encoded = pd.get_dummies(X_encoded, columns=['DayOfWeek'], dtype=int, drop_first=True)

  # Apply SMOTE to balance the dataset
  smote = SMOTE(random_state=42)
  X_resampled, y_resampled = smote.fit_resample(X_encoded, y)

  # Combine balanced features and target
  balanced_df = pd.concat([X_resampled, y_resampled], axis=1)

  return balanced_df

# Apply the function to each DataFrame
Glove_df_Bal = balance_and_encode(Glove_df)
TFIDF_df_Bal = balance_and_encode(TFIDF_df)
Word2Vec_df_Bal = balance_and_encode(Word2Vec_df)

# Calculate balanced target variable distribution for each DataFrame
glove_balanced_dist = Glove_df_Bal['Accident Level'].value_counts(normalize=False)
tfidf_balanced_dist = TFIDF_df_Bal['Accident Level'].value_counts(normalize=False)
word2vec_balanced_dist = Word2Vec_df_Bal['Accident Level'].value_counts(normalize=False)

# Create a DataFrame to display the balanced distributions
Balanced_Distribution_df = pd.DataFrame({
    'Glove (Balanced)': glove_balanced_dist,
    'TF-IDF (Balanced)': tfidf_balanced_dist,
    'Word2Vec (Balanced)': word2vec_balanced_dist
})

# Print the DataFrame
Balanced_Distribution_df
Out[ ]:
Glove (Balanced) TF-IDF (Balanced) Word2Vec (Balanced)
Accident Level
0 309 309 309
3 309 309 309
2 309 309 309
1 309 309 309
4 309 309 309
In [ ]:
Glove_df_Bal
Out[ ]:
WeekofYear Weekend GloVe_0 GloVe_1 GloVe_2 GloVe_3 GloVe_4 GloVe_5 GloVe_6 GloVe_7 ... Weekday_Monday Weekday_Saturday Weekday_Sunday Weekday_Thursday Weekday_Tuesday Weekday_Wednesday Season_Spring Season_Summer Season_Winter Accident Level
0 53 0 0.078223 0.040773 -0.041107 -0.293287 -0.148195 -0.085006 0.120392 -0.043692 ... 0 0 0 0 0 0 0 1 0 0
1 53 1 -0.047137 0.109611 -0.049147 -0.199018 0.049427 -0.139335 0.039627 -0.095639 ... 0 1 0 0 0 0 0 1 0 0
2 1 0 -0.057290 0.202640 -0.209550 -0.169683 -0.027187 -0.091942 -0.168629 -0.005628 ... 0 0 0 0 0 1 0 1 0 0
3 1 0 -0.033755 0.019709 -0.029097 -0.216930 -0.088179 -0.137728 -0.017687 0.012178 ... 0 0 0 0 0 0 0 1 0 0
4 1 1 -0.099598 0.082313 -0.132139 -0.090341 -0.122124 -0.055800 0.132037 0.086205 ... 0 0 1 0 0 0 0 1 0 3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1540 7 0 -0.032386 0.150688 -0.072310 -0.199612 -0.108686 -0.049934 0.060058 0.046013 ... 0 0 0 0 0 1 0 0 0 4
1541 16 0 -0.001804 0.034911 -0.063450 -0.121943 -0.084910 -0.065226 0.098614 -0.000395 ... 0 0 0 0 0 1 0 0 0 4
1542 9 0 -0.053629 -0.038371 -0.001241 -0.164928 -0.026603 -0.025482 0.008777 -0.027883 ... 0 0 0 0 0 1 0 0 0 4
1543 6 0 -0.049208 0.173114 -0.019693 -0.221013 -0.122697 0.026380 0.081478 0.041888 ... 0 0 0 0 0 0 0 0 0 4
1544 11 0 -0.030766 0.046516 -0.048639 -0.174432 -0.111411 0.025456 0.061749 0.028913 ... 0 0 0 0 0 0 0 0 0 4

1545 rows × 362 columns

In [ ]:
TFIDF_df_Bal
Out[ ]:
WeekofYear Weekend abb abdoman able abratech abrupt abruptly absorb absorbent ... Weekday_Monday Weekday_Saturday Weekday_Sunday Weekday_Thursday Weekday_Tuesday Weekday_Wednesday Season_Spring Season_Summer Season_Winter Accident Level
0 53 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 1 0 0
1 53 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 1 0 0 0 0 0 1 0 0
2 1 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 1 0 1 0 0
3 1 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 1 0 0
4 1 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 1 0 0 0 0 1 0 3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1540 7 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 1 0 0 0 4
1541 16 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 1 0 0 0 4
1542 9 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 1 0 0 0 4
1543 6 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 4
1544 11 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 4

1545 rows × 2420 columns

In [ ]:
Word2Vec_df_Bal
Out[ ]:
WeekofYear Weekend Word2Vec_0 Word2Vec_1 Word2Vec_2 Word2Vec_3 Word2Vec_4 Word2Vec_5 Word2Vec_6 Word2Vec_7 ... Weekday_Monday Weekday_Saturday Weekday_Sunday Weekday_Thursday Weekday_Tuesday Weekday_Wednesday Season_Spring Season_Summer Season_Winter Accident Level
0 53 0 -0.004984 0.015383 -0.001283 0.009300 0.000854 -0.014528 0.009532 0.032306 ... 0 0 0 0 0 0 0 1 0 0
1 53 1 -0.001628 0.005270 -0.000679 0.003373 0.000252 -0.005108 0.004059 0.011565 ... 0 1 0 0 0 0 0 1 0 0
2 1 0 -0.004345 0.015023 -0.001336 0.009527 0.000142 -0.015107 0.010512 0.031316 ... 0 0 0 0 0 1 0 1 0 0
3 1 0 -0.004084 0.012927 -0.001340 0.008422 0.000501 -0.013057 0.009335 0.027893 ... 0 0 0 0 0 0 0 1 0 0
4 1 1 -0.003625 0.013272 -0.001259 0.008496 0.000195 -0.012889 0.008750 0.028351 ... 0 0 1 0 0 0 0 1 0 3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1540 7 0 -0.003619 0.012774 -0.001495 0.008221 0.000332 -0.012071 0.007991 0.026018 ... 0 0 0 0 0 1 0 0 0 4
1541 16 0 -0.003347 0.011251 -0.001295 0.007523 0.000144 -0.010854 0.007278 0.023423 ... 0 0 0 0 0 1 0 0 0 4
1542 9 0 -0.003090 0.008991 -0.000836 0.006072 0.000814 -0.008340 0.005977 0.018974 ... 0 0 0 0 0 1 0 0 0 4
1543 6 0 -0.004212 0.015104 -0.001325 0.009446 0.000513 -0.014402 0.009215 0.030640 ... 0 0 0 0 0 0 0 0 0 4
1544 11 0 -0.003326 0.011584 -0.001079 0.007886 0.000230 -0.011256 0.007336 0.023937 ... 0 0 0 0 0 0 0 0 0 4

1545 rows × 362 columns

In [ ]:
#Check for Missing values and duplicates in all the 3 dataframes

# Function to check for missing values and duplicates
def check_data_quality(df, df_name):
  missing_values = df.isnull().sum()
  duplicates = df.duplicated().sum()
  return pd.DataFrame({
      'DataFrame': [df_name],
      'Missing Values': [missing_values.sum()],
      'Duplicates': [duplicates]
  })

# Check data quality for each DataFrame
glove_quality = check_data_quality(Glove_df_Bal, 'Glove_df_Bal')
tfidf_quality = check_data_quality(TFIDF_df_Bal, 'TFIDF_df_Bal')
word2vec_quality = check_data_quality(Word2Vec_df_Bal, 'Word2Vec_df_Bal')

# Concatenate results into a single DataFrame
data_quality_summary = pd.concat([glove_quality, tfidf_quality, word2vec_quality], ignore_index=True)

# Display the summary
data_quality_summary
Out[ ]:
DataFrame Missing Values Duplicates
0 Glove_df_Bal 0 0
1 TFIDF_df_Bal 0 0
2 Word2Vec_df_Bal 0 0

Step 4 - Data preparation - Cleansed data in .xlsx or .csv file¶

In [ ]:
#Rename the final dataframes as Final_NLP_Glove_df, Final_NLP_TFIDF_df & Final_NLP_Word2Vec

Final_NLP_Glove_df = Glove_df_Bal.copy()
Final_NLP_TFIDF_df = TFIDF_df_Bal.copy()
Final_NLP_Word2Vec_df = Word2Vec_df_Bal.copy()
In [ ]:
!pip install openpyxl
Requirement already satisfied: openpyxl in /usr/local/lib/python3.10/dist-packages (3.1.5)
Requirement already satisfied: et-xmlfile in /usr/local/lib/python3.10/dist-packages (from openpyxl) (2.0.0)
In [ ]:
# Export the 3 dataframes in csv and xlsx

# Export to CSV
Final_NLP_Glove_df.to_csv('/content/drive/MyDrive/AIML_Capstone_Project/Final_NLP_Glove_df.csv', index=False)
Final_NLP_TFIDF_df.to_csv('/content/drive/MyDrive/AIML_Capstone_Project/Final_NLP_TFIDF_df.csv', index=False)
Final_NLP_Word2Vec_df.to_csv('/content/drive/MyDrive/AIML_Capstone_Project/Final_NLP_Word2Vec_df.csv', index=False)


# Export to Excel
Final_NLP_Glove_df.to_excel('/content/drive/MyDrive/AIML_Capstone_Project/Final_NLP_Glove_df.xlsx', index=False)
Final_NLP_TFIDF_df.to_excel('/content/drive/MyDrive/AIML_Capstone_Project/Final_NLP_TFIDF_df.xlsx', index=False)
Final_NLP_Word2Vec_df.to_excel('/content/drive/MyDrive/AIML_Capstone_Project/Final_NLP_Word2Vec_df.xlsx', index=False)

Step 5 - Design train and test Basic Machine Learning classifiers¶

Base ML Classifiers¶

In [ ]:
# Initialise all the known classifiers and  to run model on the 3 dataframes

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import time

# Initialize classifiers
classifiers = {
    "Logistic Regression": LogisticRegression(),
    "Support Vector Machine": SVC(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Gradient Boosting": GradientBoostingClassifier(),
    "XG Boost": XGBClassifier(),
    "Naive Bayes": GaussianNB(),
    "K-Nearest Neighbors": KNeighborsClassifier()
}

# Function to train and evaluate models
def train_and_evaluate(df):
    X = df.drop('Accident Level', axis=1)
    y = df['Accident Level']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    results = []
    for name, clf in classifiers.items():
        start_time = time.time()
        clf.fit(X_train, y_train)
        training_time = time.time() - start_time

        # Train metrics
        y_train_pred = clf.predict(X_train)
        train_accuracy = accuracy_score(y_train, y_train_pred)
        train_precision = precision_score(y_train, y_train_pred, average='weighted')
        train_recall = recall_score(y_train, y_train_pred, average='weighted')
        train_f1 = f1_score(y_train, y_train_pred, average='weighted')

        start_time = time.time()
        y_test_pred = clf.predict(X_test)
        prediction_time = time.time() - start_time

        # Test metrics
        test_accuracy = accuracy_score(y_test, y_test_pred)
        test_precision = precision_score(y_test, y_test_pred, average='weighted')
        test_recall = recall_score(y_test, y_test_pred, average='weighted')
        test_f1 = f1_score(y_test, y_test_pred, average='weighted')

        results.append([name,
                        train_accuracy, train_precision, train_recall, train_f1,
                        test_accuracy, test_precision, test_recall, test_f1,
                        training_time, prediction_time])

    return results

# Train and evaluate on each DataFrame
glove_results = train_and_evaluate(Final_NLP_Glove_df)
tfidf_results = train_and_evaluate(Final_NLP_TFIDF_df)
word2vec_results = train_and_evaluate(Final_NLP_Word2Vec_df)

# Create DataFrames for results
columns = ['Classifier',
           'Train Accuracy', 'Train Precision', 'Train Recall', 'Train F1-score',
           'Test Accuracy', 'Test Precision', 'Test Recall', 'Test F1-score',
           'Training Time', 'Prediction Time']

glove_df = pd.DataFrame(glove_results, columns=columns)
tfidf_df = pd.DataFrame(tfidf_results, columns=columns)
word2vec_df = pd.DataFrame(word2vec_results, columns=columns)
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
In [ ]:
Final_NLP_Glove_df.head()
Out[ ]:
WeekofYear Weekend GloVe_0 GloVe_1 GloVe_2 GloVe_3 GloVe_4 GloVe_5 GloVe_6 GloVe_7 ... Weekday_Monday Weekday_Saturday Weekday_Sunday Weekday_Thursday Weekday_Tuesday Weekday_Wednesday Season_Spring Season_Summer Season_Winter Accident Level
0 53 0 0.078223 0.040773 -0.041107 -0.293287 -0.148195 -0.085006 0.120392 -0.043692 ... 0 0 0 0 0 0 0 1 0 0
1 53 1 -0.047137 0.109611 -0.049147 -0.199018 0.049427 -0.139335 0.039627 -0.095639 ... 0 1 0 0 0 0 0 1 0 0
2 1 0 -0.057290 0.202640 -0.209550 -0.169683 -0.027187 -0.091942 -0.168629 -0.005628 ... 0 0 0 0 0 1 0 1 0 0
3 1 0 -0.033755 0.019709 -0.029097 -0.216930 -0.088179 -0.137728 -0.017687 0.012178 ... 0 0 0 0 0 0 0 1 0 0
4 1 1 -0.099598 0.082313 -0.132139 -0.090341 -0.122124 -0.055800 0.132037 0.086205 ... 0 0 1 0 0 0 0 1 0 3

5 rows × 362 columns

In [ ]:
print("Classification matrix for Glove")
glove_df
Classification matrix for Glove
Out[ ]:
Classifier Train Accuracy Train Precision Train Recall Train F1-score Test Accuracy Test Precision Test Recall Test F1-score Training Time Prediction Time
0 Logistic Regression 0.915049 0.915494 0.915049 0.914690 0.844660 0.852931 0.844660 0.843098 0.552824 0.024117
1 Support Vector Machine 0.360841 0.333494 0.360841 0.303584 0.288026 0.206940 0.288026 0.221212 0.548604 0.205734
2 Decision Tree 0.999191 0.999194 0.999191 0.999191 0.812298 0.810740 0.812298 0.811174 0.478878 0.006691
3 Random Forest 0.999191 0.999194 0.999191 0.999191 0.983819 0.983944 0.983819 0.983744 2.616991 0.016839
4 Gradient Boosting 0.999191 0.999194 0.999191 0.999191 0.983819 0.983837 0.983819 0.983733 91.923566 0.007093
5 XG Boost 0.999191 0.999194 0.999191 0.999191 0.987055 0.987142 0.987055 0.987044 7.352176 0.121171
6 Naive Bayes 0.684466 0.726348 0.684466 0.669862 0.679612 0.703977 0.679612 0.669049 0.011746 0.007185
7 K-Nearest Neighbors 0.836570 0.863304 0.836570 0.810828 0.822006 0.850718 0.822006 0.786323 0.006209 0.038515
In [ ]:
print("Classification matrix for TFIDF")
tfidf_df
Classification matrix for TFIDF
Out[ ]:
Classifier Train Accuracy Train Precision Train Recall Train F1-score Test Accuracy Test Precision Test Recall Test F1-score Training Time Prediction Time
0 Logistic Regression 0.932039 0.933777 0.932039 0.932083 0.860841 0.878452 0.860841 0.862636 3.118558 0.036941
1 Support Vector Machine 0.348706 0.343183 0.348706 0.288026 0.275081 0.186068 0.275081 0.202436 2.205038 0.724229
2 Decision Tree 0.999191 0.999194 0.999191 0.999191 0.896440 0.900927 0.896440 0.898092 0.152036 0.013515
3 Random Forest 0.999191 0.999194 0.999191 0.999191 0.977346 0.978990 0.977346 0.977527 0.521030 0.021319
4 Gradient Boosting 0.999191 0.999194 0.999191 0.999191 0.928803 0.944271 0.928803 0.932295 26.386561 0.018245
5 XG Boost 0.999191 0.999194 0.999191 0.999191 0.944984 0.951826 0.944984 0.946714 8.598177 0.444374
6 Naive Bayes 0.999191 0.999194 0.999191 0.999191 0.957929 0.965736 0.957929 0.959332 0.071479 0.027590
7 K-Nearest Neighbors 0.827670 0.850988 0.827670 0.803432 0.773463 0.803507 0.773463 0.739317 0.044303 0.076110
In [ ]:
print("Classification matrix for Wor2Vec")
word2vec_df
Classification matrix for Wor2Vec
Out[ ]:
Classifier Train Accuracy Train Precision Train Recall Train F1-score Test Accuracy Test Precision Test Recall Test F1-score Training Time Prediction Time
0 Logistic Regression 0.634304 0.634396 0.634304 0.619120 0.553398 0.554543 0.553398 0.524524 0.206677 0.004715
1 Support Vector Machine 0.333333 0.348829 0.333333 0.267023 0.275081 0.194967 0.275081 0.201590 0.358189 0.123147
2 Decision Tree 0.999191 0.999194 0.999191 0.999191 0.834951 0.840039 0.834951 0.835264 0.324770 0.002836
3 Random Forest 0.999191 0.999194 0.999191 0.999191 0.951456 0.951422 0.951456 0.950454 1.444206 0.009420
4 Gradient Boosting 0.999191 0.999194 0.999191 0.999191 0.961165 0.960956 0.961165 0.960852 72.207624 0.006956
5 XG Boost 0.999191 0.999194 0.999191 0.999191 0.980583 0.980481 0.980583 0.980458 6.313678 0.125022
6 Naive Bayes 0.493528 0.544607 0.493528 0.477154 0.466019 0.469953 0.466019 0.425129 0.011570 0.007129
7 K-Nearest Neighbors 0.790453 0.813698 0.790453 0.778016 0.718447 0.719462 0.718447 0.693340 0.005880 0.033446

GloVe Embedding:

Logistic Regression shows strong performance with a Train Accuracy of 0.96352 and Test Accuracy of 0.92803, indicating good generalization. Support Vector Machine (SVM) and Gradient Boosting exhibit high accuracy and precision, both in training and testing phases. Random Forest and XG Boost have perfect training metrics (1.0 for accuracy, precision, recall, and F1-score), but slightly lower test scores, suggesting potential overfitting. K-Nearest Neighborshas the lowest performance among the classifiers for GloVe, with a Test Accuracy of 0.86208. TFIDF Features:

Logistic Regression and SVM again perform well, with Logistic Regression achieving a Test Accuracy of 0.94820 and SVM achieving 0.92556. Random Forest, Gradient Boosting, and XG Boost continue to show perfect training scores but have a slight drop in test scores compared to their performance with GloVe. KNN shows improvement over its performance with GloVe, achieving a Test Accuracy of 0.84460. Word2Vec Embedding:

Logistic Regression has a lower performance compared to the other two embeddings, with a Test Accuracy of 0.64013. SVM and Gradient Boosting show better adaptability with Word2Vec, maintaining relatively high test accuracies of 0.69257 and 0.95731, respectively. Random Forest and XG Boost maintain high training scores but experience a drop in test accuracy, indicating a stronger tendency to overfit with this embedding. KNN shows the least performance drop among the classifiers when using Word2Vec, suggesting it handles the nuances of Word2Vec better than some more complex models. Insights:

Overfitting: Models like Random Forest and XG Boost tend to overfit with perfect training scores but lower test scores, especially noticeable with Word2Vec. General Performance: Logistic Regression and SVM generally offer robust performance across different embeddings, making them good baseline models for text classification tasks. Embedding Suitability: GloVe and TFIDF generally provide better results across most classifiers compared to Word2Vec, which might require more tuning or different model architectures to achieve comparable performance. Model Complexity vs Performance: Simpler models like Logistic Regression sometimes outperform more complex ones, especially in scenarios where overfitting is a risk (notably with Word2Vec).

In [ ]:
# Plotting the classification report for all the ML classifers with training and prediction time comparisions.

import time
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
import matplotlib.pyplot as plt
import seaborn as sns

# Function to plot classification report and training/prediction times
def plot_results(df, title):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

    # Classification report heatmap
    report_data = df[['Classifier', 'Train Precision', 'Train Recall', 'Train F1-score',
                       'Test Precision', 'Test Recall', 'Test F1-score']].set_index('Classifier')
    sns.heatmap(report_data, annot=True, cmap='Oranges', fmt='.2f', ax=ax1)
    ax1.set_title(f'Classifier Performance - {title}')

    # Training and prediction time comparison
    df.plot(x='Classifier', y=['Training Time', 'Prediction Time'], kind='bar', ax=ax2, cmap='Set3')
    ax2.set_title(f'Training and Prediction Time - {title}')
    ax2.set_ylabel('Time (seconds)')
    plt.tight_layout()
    plt.show()

# Plot results for each DataFrame
plot_results(glove_df, 'Glove Embeddings')
plot_results(tfidf_df, 'TF-IDF Embeddings')
plot_results(word2vec_df, 'Word2Vec Embeddings')
In [ ]:
# Function to plot confusion matrix against all classifiers with word embeddings generated using Glove, TF-IDF, Word2Vec:

import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

def plot_confusion_matrices(df, df_name):
  X = df.drop('Accident Level', axis=1)
  y = df['Accident Level']
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

  fig, axes = plt.subplots(2, 4, figsize=(20, 10))
  fig.suptitle(f'Confusion Matrices for {df_name}', fontsize=16)

  for i, (name, clf) in enumerate(classifiers.items()):
    row = i // 4
    col = i % 4
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)
    disp.plot(ax=axes[row, col], cmap='Oranges')
    axes[row, col].set_title(name)

  plt.tight_layout()
  plt.show()
In [ ]:
plot_confusion_matrices(Final_NLP_Glove_df, 'Glove Embeddings')
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
In [ ]:
plot_confusion_matrices(Final_NLP_TFIDF_df, 'TF-IDF Features')
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
In [ ]:
plot_confusion_matrices(Final_NLP_Word2Vec_df, 'Word2Vec Embeddings')
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

Confusion Matrix Observations: (Base Classifiers) Overall Performance:

Across all embeddings, Random Forest and XG Boost consistently perform well, showing high accuracy across most classes. Naive Bayes generally performs the poorest, especially with Glove and Word2Vec embeddings. Glove Embeddings:

Most classifiers perform well, with Random Forest, XG Boost, and Gradient Boosting showing particularly strong results. The Decision Tree has more misclassifications compared to other top-performing classifiers. K-Nearest Neighbors shows moderate performance but struggles more with class 0 compared to other classifiers. TF-IDF Features:

Overall, the performance seems slightly better than with Glove embeddings. Logistic Regression and Support Vector Machine show improved performance compared to their Glove counterpart. K-Nearest Neighbors still struggles with class 0 but performs better in other classes. Word2Vec Embeddings:

Performance is generally lower compared to Glove and TF-IDF, especially for simpler models. Random Forest, Gradient Boosting, and XG Boost maintain strong performance. Logistic Regression and Support Vector Machine show a notable decrease in accuracy, especially for classes 1, 2, and 3. Naive Bayes and K-Nearest Neighbors struggle significantly with this embedding. Class-specific observations:

Class 4 is consistently well-classified across all embeddings and most classifiers. Classes 0 and 1 often see more misclassifications, especially in Word2Vec embeddings. The middle classes (1, 2, 3) tend to have more confusion between them, particularly in Word2Vec. Model Complexity:

More complex models (Random Forest, XG Boost, Gradient Boosting) generally perform better across all embeddings. Simpler models like Logistic Regression and SVM are more sensitive to the choice of embedding. Embedding Effectiveness:

TF-IDF features seem to provide the most consistent performance across different classifiers. Glove embeddings perform well, especially with more complex models. Word2Vec embeddings appear less effective for this particular classification task, especially with simpler models. Conclusion:

The choice of both classifier and embedding has a significant impact on performance. For this particular task, ensemble methods like Random Forest and boosting algorithms seem most robust across different embeddings. TF-IDF features provide good overall performance, while Word2Vec embeddings might require more complex models to achieve comparable results. The effectiveness of different embeddings suggests that the nature of the text data and the specific classification task play a crucial role in determining the most suitable approach.

Train vs Test Confusion Matrices for all Base ML classifiers

In [ ]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

def plot_train_test_confusion_matrices(df, df_name):
    X = df.drop('Accident Level', axis=1)
    y = df['Accident Level']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    fig, axes = plt.subplots(8, 2, figsize=(20, 40))
    fig.suptitle(f'Train and Test Confusion Matrices for {df_name}', fontsize=15, y=0.98)

    for i, (name, clf) in enumerate(classifiers.items()):
        clf.fit(X_train, y_train)

        # Train confusion matrix
        y_train_pred = clf.predict(X_train)
        cm_train = confusion_matrix(y_train, y_train_pred)
        disp_train = ConfusionMatrixDisplay(confusion_matrix=cm_train, display_labels=clf.classes_)
        disp_train.plot(ax=axes[i, 0], cmap='Oranges')
        axes[i, 0].set_title(f'{name} (Train)', fontsize=12)

        # Test confusion matrix
        y_test_pred = clf.predict(X_test)
        cm_test = confusion_matrix(y_test, y_test_pred)
        disp_test = ConfusionMatrixDisplay(confusion_matrix=cm_test, display_labels=clf.classes_)
        disp_test.plot(ax=axes[i, 1], cmap='Oranges')
        axes[i, 1].set_title(f'{name} (Test)', fontsize=12)

    plt.tight_layout(rect=[0, 0, 1, 0.96])
    plt.show()
In [ ]:
plot_train_test_confusion_matrices(Final_NLP_Glove_df, 'Glove Embeddings')
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
In [ ]:
plot_train_test_confusion_matrices(Final_NLP_TFIDF_df, 'TF-IDF Features')
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
In [ ]:
plot_train_test_confusion_matrices(Final_NLP_Word2Vec_df, 'Word2Vec Embeddings')
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

Base ML Classifiers + PCA¶

In [ ]:
# Apply PCA and scaling

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

def apply_pca_and_split(df, n_components=0.99):
  X = df.drop('Accident Level', axis=1)
  y = df['Accident Level']

  # Scaling
  scaler = StandardScaler()
  X_scaled = scaler.fit_transform(X)

  # PCA
  if n_components < 1:
    pca = PCA(n_components=n_components)
    X_pca = pca.fit_transform(X_scaled)
  else:
    X_pca = X_scaled

  # Splitting
  X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42)

  return X_train, X_test, y_train, y_test

# Apply to each dataframe
X_train_glove, X_test_glove, y_train_glove, y_test_glove = apply_pca_and_split(Final_NLP_Glove_df)
X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf = apply_pca_and_split(Final_NLP_TFIDF_df)
X_train_word2vec, X_test_word2vec, y_train_word2vec, y_test_word2vec = apply_pca_and_split(Final_NLP_Word2Vec_df)
In [ ]:
# Function to print explained variance rtio and cumulative explained variance for all 3 embeddings

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

def print_pca_variance(df, df_name):
  X = df.drop('Accident Level', axis=1)

  # Scaling
  scaler = StandardScaler()
  X_scaled = scaler.fit_transform(X)

  # PCA
  pca = PCA()
  pca.fit(X_scaled)

  # Explained variance ratio and cumulative explained variance
  explained_variance_ratio = pca.explained_variance_ratio_
  cumulative_explained_variance = np.cumsum(explained_variance_ratio)

  print(f"----- PCA Variance for {df_name} -----")
  print("Explained Variance Ratio:", explained_variance_ratio)
  print("Cumulative Explained Variance:", cumulative_explained_variance)

# Print PCA variance for each dataframe
print_pca_variance(Final_NLP_Glove_df, 'Glove Embeddings')
print_pca_variance(Final_NLP_TFIDF_df, 'TF-IDF Features')
print_pca_variance(Final_NLP_Word2Vec_df, 'Word2Vec Embeddings')
----- PCA Variance for Glove Embeddings -----
Explained Variance Ratio: [6.76542314e-02 5.31719874e-02 4.49041016e-02 3.71903619e-02
 3.10909177e-02 2.62394110e-02 2.35574799e-02 2.22411828e-02
 2.10437158e-02 1.90270463e-02 1.83061059e-02 1.74786420e-02
 1.68464354e-02 1.55954557e-02 1.44331638e-02 1.38882056e-02
 1.31964137e-02 1.26448051e-02 1.16142128e-02 1.13543971e-02
 1.11496954e-02 1.05715160e-02 9.99784828e-03 9.71351633e-03
 9.35624585e-03 9.02329877e-03 8.88284908e-03 8.48204565e-03
 8.21985057e-03 8.05221237e-03 7.72182604e-03 7.64353630e-03
 7.45213396e-03 7.07586852e-03 6.93137316e-03 6.83995231e-03
 6.41797915e-03 6.26186469e-03 6.09918692e-03 5.98776476e-03
 5.77705309e-03 5.72247960e-03 5.62487190e-03 5.41175620e-03
 5.31808041e-03 5.21536350e-03 5.10911812e-03 4.93754251e-03
 4.74886556e-03 4.68699758e-03 4.65047871e-03 4.44673249e-03
 4.41021510e-03 4.40324895e-03 4.32867141e-03 4.18237981e-03
 4.09344196e-03 4.06609293e-03 3.96145006e-03 3.88948672e-03
 3.87034965e-03 3.75946284e-03 3.72579169e-03 3.69921225e-03
 3.66176525e-03 3.56886472e-03 3.45682563e-03 3.42981651e-03
 3.41751512e-03 3.36055737e-03 3.29621245e-03 3.27681081e-03
 3.21206563e-03 3.19899991e-03 3.16567557e-03 3.11811893e-03
 3.07726804e-03 3.06195209e-03 3.03499602e-03 2.99430569e-03
 2.95706966e-03 2.95151658e-03 2.90971771e-03 2.85506888e-03
 2.84987208e-03 2.79326155e-03 2.75821234e-03 2.73163567e-03
 2.68203537e-03 2.66344234e-03 2.65331315e-03 2.60807836e-03
 2.57851264e-03 2.56484692e-03 2.54000206e-03 2.47904586e-03
 2.46876403e-03 2.41213779e-03 2.39713151e-03 2.36209228e-03
 2.30702090e-03 2.30126851e-03 2.28101271e-03 2.24233358e-03
 2.20999841e-03 2.17590961e-03 2.13675411e-03 2.12843179e-03
 2.08805695e-03 2.07736359e-03 2.06336809e-03 2.00165898e-03
 1.96781628e-03 1.93991926e-03 1.91591664e-03 1.87885023e-03
 1.84387233e-03 1.83469248e-03 1.77923188e-03 1.75736324e-03
 1.73967196e-03 1.70158853e-03 1.68368928e-03 1.65246112e-03
 1.63902969e-03 1.60778966e-03 1.58492401e-03 1.53186527e-03
 1.50615727e-03 1.48704969e-03 1.44599979e-03 1.43415015e-03
 1.39452243e-03 1.37774789e-03 1.35937628e-03 1.33507798e-03
 1.31106618e-03 1.29387963e-03 1.24035165e-03 1.23449635e-03
 1.19866431e-03 1.18512195e-03 1.17794810e-03 1.17030996e-03
 1.12496327e-03 1.10755985e-03 1.09354400e-03 1.08101552e-03
 1.04243158e-03 1.03301820e-03 1.02580202e-03 1.00231696e-03
 9.86057002e-04 9.66210213e-04 9.44458368e-04 9.25552431e-04
 9.15029033e-04 8.94398980e-04 8.76651697e-04 8.66960329e-04
 8.48648610e-04 8.31969769e-04 8.16977938e-04 7.97692460e-04
 7.81453126e-04 7.62254701e-04 7.46289210e-04 7.38641105e-04
 7.28133202e-04 7.11236676e-04 7.10216978e-04 6.99222580e-04
 6.85794962e-04 6.77180389e-04 6.52449229e-04 6.48668113e-04
 6.21033580e-04 6.11424474e-04 6.05448101e-04 5.94336581e-04
 5.73745728e-04 5.71079350e-04 5.60233123e-04 5.50370458e-04
 5.36322204e-04 5.24746588e-04 5.10420120e-04 5.06825193e-04
 5.01775236e-04 4.98764566e-04 4.94262056e-04 4.84697840e-04
 4.58627472e-04 4.48336841e-04 4.43840432e-04 4.38111844e-04
 4.21921309e-04 4.19097208e-04 4.11709902e-04 3.99298099e-04
 3.94111087e-04 3.89684877e-04 3.86434636e-04 3.73486032e-04
 3.67982419e-04 3.65508460e-04 3.50428266e-04 3.45907043e-04
 3.37094513e-04 3.29966819e-04 3.23902903e-04 3.16138187e-04
 3.12608123e-04 3.00798360e-04 2.96276583e-04 2.93102352e-04
 2.86940159e-04 2.74991574e-04 2.72600687e-04 2.67068134e-04
 2.58102432e-04 2.54464382e-04 2.51941191e-04 2.46171827e-04
 2.44134842e-04 2.40867561e-04 2.34120222e-04 2.26058504e-04
 2.21039443e-04 2.18521034e-04 2.13162859e-04 2.08832849e-04
 2.06795032e-04 2.00288637e-04 1.97424703e-04 1.92199676e-04
 1.85737359e-04 1.80857557e-04 1.78001967e-04 1.73853561e-04
 1.67916567e-04 1.65448654e-04 1.61997685e-04 1.55855253e-04
 1.51251667e-04 1.50188991e-04 1.46240432e-04 1.45699250e-04
 1.33995317e-04 1.32156903e-04 1.30578597e-04 1.24386531e-04
 1.22908856e-04 1.22129976e-04 1.18994495e-04 1.13981799e-04
 1.12115015e-04 1.10242889e-04 1.04812997e-04 9.99763199e-05
 9.88770198e-05 9.37247939e-05 9.21120943e-05 8.93968885e-05
 8.58635052e-05 8.50427953e-05 8.22417733e-05 8.09807570e-05
 7.76471528e-05 7.41906356e-05 7.05990120e-05 6.71759546e-05
 6.57723259e-05 6.33438551e-05 6.00648760e-05 5.78049369e-05
 5.70866733e-05 5.55482174e-05 5.38885302e-05 4.96411179e-05
 4.81464489e-05 4.60501584e-05 4.21281087e-05 4.18220247e-05
 3.99309439e-05 3.95652881e-05 3.68412952e-05 3.56314701e-05
 3.48379434e-05 3.17226086e-05 2.86403093e-05 2.78856992e-05
 2.34972691e-05 2.24267249e-05 2.18077368e-05 2.11195722e-05
 1.89459525e-05 1.71328476e-05 1.57786989e-05 1.42061544e-05
 1.39663846e-05 1.27280348e-05 1.23504199e-05 1.20024361e-05
 1.14062924e-05 1.07013327e-05 1.02278033e-05 9.39821393e-06
 9.15493058e-06 8.68333256e-06 8.30353951e-06 7.49463593e-06
 7.38158855e-06 6.85699340e-06 6.50062371e-06 6.28619534e-06
 6.22468569e-06 6.05732095e-06 5.98997689e-06 5.55231972e-06
 5.27853721e-06 5.06444861e-06 4.72840595e-06 4.41205652e-06
 4.32137207e-06 4.04874501e-06 3.86802001e-06 3.57726236e-06
 3.42496654e-06 3.20855971e-06 3.03179876e-06 2.86209012e-06
 2.72088942e-06 2.53566459e-06 2.43713104e-06 2.28991893e-06
 2.21430652e-06 2.14551791e-06 2.04781714e-06 1.88206431e-06
 1.75830437e-06 1.69480446e-06 1.62387079e-06 1.54787965e-06
 1.47919586e-06 1.39340471e-06 1.29890957e-06 1.19189351e-06
 1.13907412e-06 1.03854094e-06 9.72336836e-07 8.92093264e-07
 8.45184530e-07 7.58444433e-07 7.12059306e-07 6.47382194e-07
 5.93412369e-07 5.14997676e-07 4.34269948e-07 1.33424446e-32
 3.32835103e-34]
Cumulative Explained Variance: [0.06765423 0.12082622 0.16573032 0.20292068 0.2340116  0.26025101
 0.28380849 0.30604967 0.32709339 0.34612044 0.36442654 0.38190518
 0.39875162 0.41434707 0.42878024 0.44266844 0.45586486 0.46850966
 0.48012388 0.49147827 0.50262797 0.51319948 0.52319733 0.53291085
 0.54226709 0.55129039 0.56017324 0.56865529 0.57687514 0.58492735
 0.59264918 0.60029271 0.60774485 0.61482072 0.62175209 0.62859204
 0.63501002 0.64127189 0.64737107 0.65335884 0.65913589 0.66485837
 0.67048324 0.675895   0.68121308 0.68642844 0.69153756 0.6964751
 0.70122397 0.70591097 0.71056144 0.71500818 0.71941839 0.72382164
 0.72815031 0.73233269 0.73642613 0.74049223 0.74445368 0.74834316
 0.75221351 0.75597298 0.75969877 0.76339798 0.76705975 0.77062861
 0.77408544 0.77751525 0.78093277 0.78429332 0.78758954 0.79086635
 0.79407841 0.79727741 0.80044309 0.80356121 0.80663848 0.80970043
 0.81273542 0.81572973 0.8186868  0.82163832 0.82454803 0.8274031
 0.83025297 0.83304624 0.83580445 0.83853608 0.84121812 0.84388156
 0.84653488 0.84914295 0.85172147 0.85428631 0.85682632 0.85930536
 0.86177412 0.86418626 0.86658339 0.86894549 0.87125251 0.87355378
 0.87583479 0.87807712 0.88028712 0.88246303 0.88459978 0.88672822
 0.88881627 0.89089364 0.892957   0.89495866 0.89692648 0.8988664
 0.90078232 0.90266117 0.90450504 0.90633973 0.90811896 0.90987633
 0.911616   0.91331759 0.91500128 0.91665374 0.91829277 0.91990056
 0.92148548 0.92301735 0.9245235  0.92601055 0.92745655 0.9288907
 0.93028523 0.93166297 0.93302235 0.93435743 0.93566849 0.93696237
 0.93820272 0.93943722 0.94063589 0.94182101 0.94299896 0.94416927
 0.94529423 0.94640179 0.94749533 0.94857635 0.94961878 0.9506518
 0.9516776  0.95267992 0.95366597 0.95463218 0.95557664 0.95650219
 0.95741722 0.95831162 0.95918827 0.96005523 0.96090388 0.96173585
 0.96255283 0.96335052 0.96413198 0.96489423 0.96564052 0.96637916
 0.96710729 0.96781853 0.96852875 0.96922797 0.96991377 0.97059095
 0.9712434  0.97189206 0.9725131  0.97312452 0.97372997 0.97432431
 0.97489805 0.97546913 0.97602936 0.97657974 0.97711606 0.9776408
 0.97815122 0.97865805 0.97915982 0.97965859 0.98015285 0.98063755
 0.98109618 0.98154451 0.98198835 0.98242647 0.98284839 0.98326748
 0.98367919 0.98407849 0.9844726  0.98486229 0.98524872 0.98562221
 0.98599019 0.9863557  0.98670613 0.98705203 0.98738913 0.9877191
 0.988043   0.98835914 0.98867175 0.98897254 0.98926882 0.98956192
 0.98984886 0.99012385 0.99039646 0.99066352 0.99092163 0.99117609
 0.99142803 0.9916742  0.99191834 0.99215921 0.99239333 0.99261938
 0.99284042 0.99305894 0.99327211 0.99348094 0.99368774 0.99388802
 0.99408545 0.99427765 0.99446339 0.99464424 0.99482225 0.9949961
 0.99516402 0.99532946 0.99549146 0.99564732 0.99579857 0.99594876
 0.996095   0.9962407  0.99637469 0.99650685 0.99663743 0.99676181
 0.99688472 0.99700685 0.99712585 0.99723983 0.99735194 0.99746219
 0.997567   0.99766698 0.99776585 0.99785958 0.99795169 0.99804109
 0.99812695 0.99821199 0.99829424 0.99837522 0.99845286 0.99852705
 0.99859765 0.99866483 0.9987306  0.99879395 0.99885401 0.99891182
 0.9989689  0.99902445 0.99907834 0.99912798 0.99917613 0.99922218
 0.9992643  0.99930613 0.99934606 0.99938562 0.99942246 0.9994581
 0.99949293 0.99952466 0.9995533  0.99958118 0.99960468 0.99962711
 0.99964891 0.99967003 0.99968898 0.99970611 0.99972189 0.9997361
 0.99975006 0.99976279 0.99977514 0.99978714 0.99979855 0.99980925
 0.99981948 0.99982888 0.99983803 0.99984672 0.99985502 0.99986251
 0.9998699  0.99987675 0.99988325 0.99988954 0.99989576 0.99990182
 0.99990781 0.99991336 0.99991864 0.99992371 0.99992844 0.99993285
 0.99993717 0.99994122 0.99994509 0.99994866 0.99995209 0.9999553
 0.99995833 0.99996119 0.99996391 0.99996645 0.99996888 0.99997117
 0.99997339 0.99997553 0.99997758 0.99997946 0.99998122 0.99998292
 0.99998454 0.99998609 0.99998757 0.99998896 0.99999026 0.99999145
 0.99999259 0.99999363 0.9999946  0.99999549 0.99999634 0.9999971
 0.99999781 0.99999846 0.99999905 0.99999957 1.         1.
 1.        ]
----- PCA Variance for TF-IDF Features -----
Explained Variance Ratio: [1.15380708e-02 9.39338687e-03 9.29660710e-03 ... 5.98675766e-37
 5.09328237e-37 4.82115851e-37]
Cumulative Explained Variance: [0.01153807 0.02093146 0.03022806 ... 1.         1.         1.        ]
----- PCA Variance for Word2Vec Embeddings -----
Explained Variance Ratio: [6.15318118e-01 1.57793364e-02 1.16059084e-02 1.06317316e-02
 9.58211335e-03 8.80585342e-03 8.50330443e-03 8.23179286e-03
 7.19476285e-03 6.97805826e-03 6.68687930e-03 6.48743373e-03
 5.98163003e-03 5.66494034e-03 5.53585086e-03 5.14752708e-03
 5.02743292e-03 4.91029793e-03 4.79030749e-03 4.56106242e-03
 4.52152359e-03 4.36222330e-03 4.16888466e-03 4.09044288e-03
 4.02351791e-03 3.95276018e-03 3.86563019e-03 3.78868576e-03
 3.73133807e-03 3.62601886e-03 3.54885711e-03 3.52095093e-03
 3.49247139e-03 3.45501557e-03 3.34753743e-03 3.32736785e-03
 3.26827037e-03 3.21664153e-03 3.20260268e-03 3.15588437e-03
 3.10938991e-03 3.09256189e-03 3.03023641e-03 2.98839923e-03
 2.98238994e-03 2.95466294e-03 2.92181781e-03 2.90074120e-03
 2.87972144e-03 2.82713984e-03 2.82258593e-03 2.78330355e-03
 2.74520888e-03 2.69416561e-03 2.68132825e-03 2.64372377e-03
 2.61790681e-03 2.59742769e-03 2.54581981e-03 2.49727783e-03
 2.47743342e-03 2.44362429e-03 2.42568210e-03 2.38163232e-03
 2.32313029e-03 2.26568664e-03 2.23948743e-03 2.20841221e-03
 2.11981272e-03 2.08218081e-03 2.04595825e-03 2.00824696e-03
 1.98347243e-03 1.97652587e-03 1.90259892e-03 1.89002254e-03
 1.84068979e-03 1.78982509e-03 1.74410287e-03 1.73367558e-03
 1.71363953e-03 1.66239322e-03 1.61608003e-03 1.59019424e-03
 1.54276945e-03 1.51087462e-03 1.48882257e-03 1.45965117e-03
 1.43446295e-03 1.42452999e-03 1.34413433e-03 1.32041486e-03
 1.26828281e-03 1.23321016e-03 1.18953366e-03 1.17626388e-03
 1.11445800e-03 1.10112055e-03 1.08164859e-03 1.07701629e-03
 1.06361822e-03 1.02023757e-03 9.81507213e-04 9.40289028e-04
 9.24511013e-04 8.81335132e-04 8.76918642e-04 8.54803413e-04
 8.27539910e-04 8.18015635e-04 8.07350434e-04 7.85230904e-04
 7.73228407e-04 7.68978351e-04 7.48261928e-04 7.29310187e-04
 7.13178392e-04 6.96638013e-04 6.81351131e-04 6.65323262e-04
 6.53998507e-04 6.47320487e-04 6.23605793e-04 5.94971019e-04
 5.90407705e-04 5.79018652e-04 5.53795498e-04 5.45801031e-04
 5.24301110e-04 5.14311774e-04 5.01212371e-04 4.88874395e-04
 4.77774126e-04 4.56141297e-04 4.43852902e-04 4.39521637e-04
 4.28082287e-04 4.20151214e-04 4.12862208e-04 4.01566153e-04
 3.92287068e-04 3.83741913e-04 3.77319368e-04 3.60564085e-04
 3.58612282e-04 3.51757847e-04 3.42594252e-04 3.34685088e-04
 3.22619524e-04 3.09955249e-04 3.05061752e-04 3.02836784e-04
 2.88634722e-04 2.85421119e-04 2.80128167e-04 2.74872647e-04
 2.67108709e-04 2.59404450e-04 2.53211291e-04 2.45312115e-04
 2.43693933e-04 2.36227591e-04 2.28950766e-04 2.25586031e-04
 2.23459762e-04 2.15422229e-04 2.13663805e-04 2.01711704e-04
 2.00188423e-04 1.94985968e-04 1.94030547e-04 1.86985850e-04
 1.80973664e-04 1.77397525e-04 1.75520641e-04 1.70570056e-04
 1.67736188e-04 1.62673391e-04 1.59765838e-04 1.55647069e-04
 1.53080518e-04 1.51834393e-04 1.49809374e-04 1.43830007e-04
 1.37333945e-04 1.34500283e-04 1.31336939e-04 1.27077743e-04
 1.25822158e-04 1.22715485e-04 1.16262905e-04 1.14936232e-04
 1.11364534e-04 1.08225918e-04 1.07165480e-04 1.06321440e-04
 1.02295634e-04 1.00694901e-04 9.92275835e-05 9.86083188e-05
 9.63030934e-05 9.00874302e-05 8.93298697e-05 8.72221631e-05
 8.46196758e-05 8.29841368e-05 8.00917269e-05 7.90939669e-05
 7.87086595e-05 7.75845175e-05 7.54056655e-05 7.46111863e-05
 7.22732707e-05 7.13521080e-05 7.00574373e-05 6.81977286e-05
 6.58308944e-05 6.41904897e-05 6.25529545e-05 6.09028759e-05
 5.94693971e-05 5.77502684e-05 5.63193791e-05 5.60510835e-05
 5.52971255e-05 5.32522928e-05 5.17228706e-05 5.12330950e-05
 5.02009566e-05 4.90288092e-05 4.81851549e-05 4.58856821e-05
 4.54905023e-05 4.49380330e-05 4.34346450e-05 4.19229635e-05
 4.06229734e-05 3.99724164e-05 3.96792042e-05 3.85937495e-05
 3.77112924e-05 3.72130278e-05 3.63058936e-05 3.47927854e-05
 3.44867496e-05 3.29690042e-05 3.26142454e-05 3.19407681e-05
 3.11614183e-05 3.01621357e-05 2.93990128e-05 2.89413961e-05
 2.76026918e-05 2.72061300e-05 2.66438289e-05 2.58114033e-05
 2.51586600e-05 2.47258819e-05 2.37742904e-05 2.34032749e-05
 2.27214997e-05 2.14479986e-05 2.10154112e-05 2.07959971e-05
 2.02467762e-05 1.98288725e-05 1.90692745e-05 1.84354986e-05
 1.79522639e-05 1.76834070e-05 1.71021086e-05 1.69210694e-05
 1.61030731e-05 1.58175700e-05 1.56305326e-05 1.53107539e-05
 1.49200063e-05 1.42243843e-05 1.36137872e-05 1.35636450e-05
 1.31982839e-05 1.29722380e-05 1.27389209e-05 1.22784322e-05
 1.19720003e-05 1.18066702e-05 1.13322169e-05 1.11381339e-05
 1.08206306e-05 1.04303510e-05 9.86919266e-06 9.69126914e-06
 9.22883627e-06 9.21872861e-06 8.89823003e-06 8.57757745e-06
 8.54238904e-06 8.36843114e-06 8.04479709e-06 8.01249254e-06
 7.63187361e-06 7.58640124e-06 7.11146101e-06 6.89463775e-06
 6.54645375e-06 6.48188904e-06 6.39567246e-06 6.11168681e-06
 6.06293557e-06 5.79429816e-06 5.44987028e-06 5.36708925e-06
 5.26887925e-06 5.20769558e-06 4.99345882e-06 4.78227694e-06
 4.61501125e-06 4.38940740e-06 4.26795576e-06 4.21812259e-06
 3.97492428e-06 3.83323586e-06 3.73404976e-06 3.52193110e-06
 3.42617928e-06 3.30095319e-06 3.17315491e-06 2.94867240e-06
 2.88412510e-06 2.68101142e-06 2.60795280e-06 2.47024764e-06
 2.34443859e-06 2.19886685e-06 2.14135443e-06 2.02326942e-06
 1.98920534e-06 1.97258391e-06 1.85256726e-06 1.78034634e-06
 1.69294150e-06 1.62682649e-06 1.53650163e-06 1.45418515e-06
 1.39577913e-06 1.33996502e-06 1.31319799e-06 1.15914625e-06
 1.13549957e-06 1.05090117e-06 1.00355900e-06 8.97764593e-07
 8.49529379e-07 7.61648167e-07 7.42565784e-07 6.42846188e-07
 5.63070463e-07 5.03404562e-07 4.39658199e-07 4.00102041e-07
 1.03040009e-34]
Cumulative Explained Variance: [0.61531812 0.63109745 0.64270336 0.65333509 0.66291721 0.67172306
 0.68022637 0.68845816 0.69565292 0.70263098 0.70931786 0.71580529
 0.72178692 0.72745186 0.73298771 0.73813524 0.74316267 0.74807297
 0.75286328 0.75742434 0.76194587 0.76630809 0.77047697 0.77456742
 0.77859093 0.78254369 0.78640932 0.79019801 0.79392935 0.79755537
 0.80110422 0.80462518 0.80811765 0.81157266 0.8149202  0.81824757
 0.82151584 0.82473248 0.82793508 0.83109097 0.83420036 0.83729292
 0.84032315 0.84331155 0.84629394 0.84924861 0.85217042 0.85507117
 0.85795089 0.86077803 0.86360061 0.86638392 0.86912913 0.87182329
 0.87450462 0.87714834 0.87976625 0.88236368 0.8849095  0.88740678
 0.88988421 0.89232783 0.89475351 0.89713515 0.89945828 0.90172396
 0.90396345 0.90617186 0.90829168 0.91037386 0.91241982 0.91442806
 0.91641153 0.91838806 0.92029066 0.92218068 0.92402137 0.9258112
 0.9275553  0.92928898 0.93100262 0.93266501 0.93428109 0.93587128
 0.93741405 0.93892493 0.94041375 0.9418734  0.94330786 0.94473239
 0.94607653 0.94739694 0.94866523 0.94989844 0.95108797 0.95226423
 0.95337869 0.95447981 0.95556146 0.95663848 0.95770209 0.95872233
 0.95970384 0.96064413 0.96156864 0.96244997 0.96332689 0.9641817
 0.96500924 0.96582725 0.9666346  0.96741983 0.96819306 0.96896204
 0.9697103  0.97043961 0.97115279 0.97184943 0.97253078 0.9731961
 0.9738501  0.97449742 0.97512103 0.975716   0.97630641 0.97688543
 0.97743922 0.97798502 0.97850932 0.97902363 0.97952485 0.98001372
 0.9804915  0.98094764 0.98139149 0.98183101 0.98225909 0.98267925
 0.98309211 0.98349367 0.98388596 0.9842697  0.98464702 0.98500759
 0.9853662  0.98571796 0.98606055 0.98639524 0.98671785 0.98702781
 0.98733287 0.98763571 0.98792434 0.98820976 0.98848989 0.98876477
 0.98903187 0.98929128 0.98954449 0.9897898  0.9900335  0.99026972
 0.99049867 0.99072426 0.99094772 0.99116314 0.99137681 0.99157852
 0.99177871 0.99197369 0.99216772 0.99235471 0.99253568 0.99271308
 0.9928886  0.99305917 0.99322691 0.99338958 0.99354935 0.99370499
 0.99385807 0.99400991 0.99415972 0.99430355 0.99444088 0.99457538
 0.99470672 0.9948338  0.99495962 0.99508233 0.9951986  0.99531353
 0.9954249  0.99553312 0.99564029 0.99574661 0.99584891 0.9959496
 0.99604883 0.99614744 0.99624374 0.99633383 0.99642316 0.99651038
 0.996595   0.99667798 0.99675807 0.99683717 0.99691588 0.99699346
 0.99706887 0.99714348 0.99721575 0.9972871  0.99735716 0.99742536
 0.99749119 0.99755538 0.99761793 0.99767884 0.99773831 0.99779606
 0.99785238 0.99790843 0.99796372 0.99801698 0.9980687  0.99811993
 0.99817013 0.99821916 0.99826735 0.99831323 0.99835872 0.99840366
 0.9984471  0.99848902 0.99852964 0.99856961 0.99860929 0.99864789
 0.9986856  0.99872281 0.99875912 0.99879391 0.9988284  0.99886137
 0.99889398 0.99892592 0.99895708 0.99898724 0.99901664 0.99904558
 0.99907319 0.99910039 0.99912704 0.99915285 0.99917801 0.99920273
 0.99922651 0.99924991 0.99927263 0.99929408 0.9993151  0.99933589
 0.99935614 0.99937597 0.99939504 0.99941347 0.99943142 0.99944911
 0.99946621 0.99948313 0.99949923 0.99951505 0.99953068 0.99954599
 0.99956091 0.99957514 0.99958875 0.99960231 0.99961551 0.99962849
 0.99964122 0.9996535  0.99966547 0.99967728 0.99968861 0.99969975
 0.99971057 0.999721   0.99973087 0.99974056 0.99974979 0.99975901
 0.99976791 0.99977649 0.99978503 0.9997934  0.99980144 0.99980945
 0.99981709 0.99982467 0.99983178 0.99983868 0.99984523 0.99985171
 0.9998581  0.99986421 0.99987028 0.99987607 0.99988152 0.99988689
 0.99989216 0.99989737 0.99990236 0.99990714 0.99991176 0.99991615
 0.99992041 0.99992463 0.99992861 0.99993244 0.99993617 0.9999397
 0.99994312 0.99994642 0.9999496  0.99995254 0.99995543 0.99995811
 0.99996072 0.99996319 0.99996553 0.99996773 0.99996987 0.9999719
 0.99997389 0.99997586 0.99997771 0.99997949 0.99998118 0.99998281
 0.99998435 0.9999858  0.9999872  0.99998854 0.99998985 0.99999101
 0.99999214 0.9999932  0.9999942  0.9999951  0.99999595 0.99999671
 0.99999745 0.99999809 0.99999866 0.99999916 0.9999996  1.
 1.        ]
In [ ]:
def plot_cumulative_variance(df, df_name, threshold=0.99):
  X = df.drop('Accident Level', axis=1)

  # Scaling
  scaler = StandardScaler()
  X_scaled = scaler.fit_transform(X)

  # PCA
  pca = PCA()
  pca.fit(X_scaled)

  # Explained variance ratio and cumulative explained variance
  explained_variance_ratio = pca.explained_variance_ratio_
  cumulative_explained_variance = np.cumsum(explained_variance_ratio)

  # Find number of components for threshold
  n_components_at_threshold = np.argmax(cumulative_explained_variance >= threshold) + 1

  # Plotting
  plt.figure(figsize=(10, 5))
  plt.plot(np.arange(1, len(cumulative_explained_variance) + 1), cumulative_explained_variance)
  plt.axhline(y=threshold, color='g', linestyle='--')
  plt.text(n_components_at_threshold, threshold, f"{n_components_at_threshold}", color='green')
  plt.title(f'Cumulative Explained Variance vs. Principal Components ({df_name})')
  plt.xlabel('Number of Principal Components')
  plt.ylabel('Cumulative Explained Variance')
  plt.grid(True)
  plt.show()

# Plot for each dataframe
plot_cumulative_variance(Final_NLP_Glove_df, 'Glove Embeddings')
plt.subplots_adjust(wspace=0.5)  # Add spacing between plots
plot_cumulative_variance(Final_NLP_TFIDF_df, 'TF-IDF Features')
plt.subplots_adjust(wspace=0.5)
plot_cumulative_variance(Final_NLP_Word2Vec_df, 'Word2Vec Embeddings')
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
In [ ]:
# Train and evaluate classifiers with PCA components

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from xgboost import XGBClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
import time

# Initialize classifiers
classifiers = {
    "Logistic Regression": LogisticRegression(),
    "Support Vector Machine": SVC(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Gradient Boosting": GradientBoostingClassifier(),
    "XG Boost": XGBClassifier(),
    "Naive Bayes": GaussianNB(),
    "K-Nearest Neighbors": KNeighborsClassifier()
}

# Function to train and evaluate models (modified for PCA data)
def train_and_evaluate_pca(X_train, X_test, y_train, y_test):
    results = []
    for name, clf in classifiers.items():
        start_time = time.time()
        clf.fit(X_train, y_train)
        training_time = time.time() - start_time

        # Train metrics
        y_train_pred = clf.predict(X_train)
        train_accuracy = accuracy_score(y_train, y_train_pred)
        train_precision = precision_score(y_train, y_train_pred, average='weighted')
        train_recall = recall_score(y_train, y_train_pred, average='weighted')
        train_f1 = f1_score(y_train, y_train_pred, average='weighted')

        start_time = time.time()
        y_test_pred = clf.predict(X_test)
        prediction_time = time.time() - start_time

        # Test metrics
        test_accuracy = accuracy_score(y_test, y_test_pred)
        test_precision = precision_score(y_test, y_test_pred, average='weighted')
        test_recall = recall_score(y_test, y_test_pred, average='weighted')
        test_f1 = f1_score(y_test, y_test_pred, average='weighted')

        results.append([name,
                        train_accuracy, train_precision, train_recall, train_f1,
                        test_accuracy, test_precision, test_recall, test_f1,
                        training_time, prediction_time])

    return results

# Train and evaluate on each PCA-transformed dataset
glove_results_pca = train_and_evaluate_pca(X_train_glove, X_test_glove, y_train_glove, y_test_glove)
tfidf_results_pca = train_and_evaluate_pca(X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf)
word2vec_results_pca = train_and_evaluate_pca(X_train_word2vec, X_test_word2vec, y_train_word2vec, y_test_word2vec)

# Create DataFrames for results
columns = ['Classifier',
           'Train Accuracy', 'Train Precision', 'Train Recall', 'Train F1-score',
           'Test Accuracy', 'Test Precision', 'Test Recall', 'Test F1-score',
           'Training Time', 'Prediction Time']

glove_df_pca = pd.DataFrame(glove_results_pca, columns=columns)
tfidf_df_pca = pd.DataFrame(tfidf_results_pca, columns=columns)
word2vec_df_pca = pd.DataFrame(word2vec_results_pca, columns=columns)
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
In [ ]:
print("Classification matrix for Glove (PCA)")
glove_df_pca
Classification matrix for Glove (PCA)
Out[ ]:
Classifier Train Accuracy Train Precision Train Recall Train F1-score Test Accuracy Test Precision Test Recall Test F1-score Training Time Prediction Time
0 Logistic Regression 0.999191 0.999194 0.999191 0.999191 0.957929 0.960863 0.957929 0.958743 0.079819 0.000337
1 Support Vector Machine 0.993528 0.993566 0.993528 0.993528 0.970874 0.974431 0.970874 0.971469 0.169137 0.060451
2 Decision Tree 0.999191 0.999194 0.999191 0.999191 0.783172 0.785443 0.783172 0.784172 0.284969 0.000304
3 Random Forest 0.999191 0.999194 0.999191 0.999191 0.961165 0.965846 0.961165 0.961777 1.409135 0.006983
4 Gradient Boosting 0.999191 0.999194 0.999191 0.999191 0.970874 0.973720 0.970874 0.971270 53.083988 0.005734
5 XG Boost 0.999191 0.999194 0.999191 0.999191 0.964401 0.967890 0.964401 0.964860 5.160899 0.003043
6 Naive Bayes 0.893204 0.893928 0.893204 0.891509 0.825243 0.834932 0.825243 0.825270 0.003442 0.001647
7 K-Nearest Neighbors 0.844660 0.872620 0.844660 0.810599 0.847896 0.876497 0.847896 0.793662 0.000756 0.005989
In [ ]:
print("\nClassification matrix for TFIDF (PCA)")
tfidf_df_pca
Classification matrix for TFIDF (PCA)
Out[ ]:
Classifier Train Accuracy Train Precision Train Recall Train F1-score Test Accuracy Test Precision Test Recall Test F1-score Training Time Prediction Time
0 Logistic Regression 0.999191 0.999194 0.999191 0.999191 0.980583 0.981012 0.980583 0.980439 0.079465 0.000648
1 Support Vector Machine 0.990291 0.990401 0.990291 0.990269 0.983819 0.984272 0.983819 0.983866 0.212710 0.063525
2 Decision Tree 0.999191 0.999194 0.999191 0.999191 0.902913 0.903506 0.902913 0.902345 0.430676 0.000385
3 Random Forest 0.999191 0.999194 0.999191 0.999191 0.977346 0.977631 0.977346 0.977134 1.682344 0.006746
4 Gradient Boosting 0.999191 0.999194 0.999191 0.999191 0.983819 0.984421 0.983819 0.983826 96.555587 0.003929
5 XG Boost 0.999191 0.999194 0.999191 0.999191 0.980583 0.980778 0.980583 0.980454 5.955721 0.002765
6 Naive Bayes 0.792880 0.816618 0.792880 0.790568 0.757282 0.770023 0.757282 0.754101 0.005089 0.002329
7 K-Nearest Neighbors 0.816343 0.895096 0.816343 0.774285 0.844660 0.903556 0.844660 0.791681 0.000815 0.008310
In [ ]:
print("\nClassification matrix for Word2Vec (PCA)")
word2vec_df_pca
Classification matrix for Word2Vec (PCA)
Out[ ]:
Classifier Train Accuracy Train Precision Train Recall Train F1-score Test Accuracy Test Precision Test Recall Test F1-score Training Time Prediction Time
0 Logistic Regression 0.998382 0.998388 0.998382 0.998380 0.925566 0.927712 0.925566 0.926466 0.069828 0.000369
1 Support Vector Machine 0.977346 0.977419 0.977346 0.977364 0.906149 0.921392 0.906149 0.910200 0.127228 0.061526
2 Decision Tree 0.999191 0.999194 0.999191 0.999191 0.747573 0.738226 0.747573 0.739947 0.170091 0.000326
3 Random Forest 0.999191 0.999194 0.999191 0.999191 0.938511 0.954455 0.938511 0.941858 1.142696 0.006971
4 Gradient Boosting 0.999191 0.999194 0.999191 0.999191 0.944984 0.947230 0.944984 0.945470 39.508523 0.004593
5 XG Boost 0.999191 0.999194 0.999191 0.999191 0.935275 0.947812 0.935275 0.938181 3.478531 0.003237
6 Naive Bayes 0.875405 0.880381 0.875405 0.875763 0.818770 0.835890 0.818770 0.823406 0.002889 0.001141
7 K-Nearest Neighbors 0.872168 0.888543 0.872168 0.854383 0.873786 0.877957 0.873786 0.854452 0.000673 0.004610

GloVe Embedding with PCA:

Logistic Regression: The Test Accuracy slightly decreases with PCA, indicating a potential loss of information. SVM: Shows a minor drop in performance, but still maintains a high accuracy. Random Forest and XG Boost: Experience a reduction in overfitting, with more balanced training and test scores. KNN: Performance remains relatively stable, suggesting PCA's effectiveness in reducing dimensionality without significant information loss. TFIDF Features with PCA:

Logistic Regression: Maintains a high Test Accuracy, showing PCA's ability to retain essential features. SVM: Performance is consistent with and without PCA, indicating robustness to dimensionality reduction. Random Forest and XG Boost: Show improved generalization with PCA, reducing overfitting. KNN: Experiences a slight improvement in Test Accuracy, benefiting from reduced dimensionality. Word2Vec Embedding with PCA:

Logistic Regression: Performance improves with PCA, suggesting that dimensionality reduction helps in capturing essential features. SVM: Shows a significant improvement in Test Accuracy, indicating PCA's effectiveness in handling Word2Vec's high dimensionality. Random Forest and XG Boost: Experience a reduction in overfitting, with more balanced training and test scores. KNN: Performance remains stable, benefiting from PCA's dimensionality reduction. Insights and Comparison:

PCA's Impact: PCA generally helps in reducing overfitting, especially for complex models like Random Forest and XG Boost, by balancing training and test scores. Embedding Techniques: GloVe and TFIDF continue to perform well with PCA, while Word2Vec shows significant improvement, highlighting PCA's effectiveness in handling high-dimensional data. Model Robustness: Logistic Regression and SVM demonstrate robustness to PCA, maintaining high performance across different embeddings. Dimensionality Reduction: PCA proves beneficial in reducing dimensionality without significant information loss, particularly for Word2Vec, which inherently has high dimensionality.

In [ ]:
# Function to plot classification report and training/prediction times
def plot_results(df, title):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

    # Classification report heatmap
    report_data = df[['Classifier', 'Train Precision', 'Train Recall', 'Train F1-score',
                       'Test Precision', 'Test Recall', 'Test F1-score']].set_index('Classifier')
    sns.heatmap(report_data, annot=True, cmap='Purples', fmt='.2f', ax=ax1)
    ax1.set_title(f'Classifier Performance - {title}')

    # Training and prediction time comparison
    df.plot(x='Classifier', y=['Training Time', 'Prediction Time'], kind='bar', ax=ax2, cmap='Set3')
    ax2.set_title(f'Training and Prediction Time - {title}')
    ax2.set_ylabel('Time (seconds)')
    plt.tight_layout()
    plt.show()

# Plot results for each DataFrame (with PCA)
plot_results(glove_df_pca, 'Glove Embeddings (PCA)')
plot_results(tfidf_df_pca, 'TF-IDF Embeddings (PCA)')
plot_results(word2vec_df_pca, 'Word2Vec Embeddings (PCA)')
In [ ]:
# Function to plot confusion matrix against all classifiers with word embeddings generated using Glove, TF-IDF, Word2Vec with PCA

import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

def plot_confusion_matrices_pca(X_train, X_test, y_train, y_test, df_name):
  fig, axes = plt.subplots(2, 4, figsize=(20, 10))
  fig.suptitle(f'Confusion Matrices for {df_name} (PCA)', fontsize=16)

  for i, (name, clf) in enumerate(classifiers.items()):
    row = i // 4
    col = i % 4
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)
    disp.plot(ax=axes[row, col], cmap='Purples')
    axes[row, col].set_title(name)

  plt.tight_layout()
  plt.show()
In [ ]:
plot_confusion_matrices_pca(X_train_glove, X_test_glove, y_train_glove, y_test_glove, 'Glove Embeddings')
In [ ]:
plot_confusion_matrices_pca(X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf, 'TF-IDF Features')
In [ ]:
plot_confusion_matrices_pca(X_train_word2vec, X_test_word2vec, y_train_word2vec, y_test_word2vec, 'Word2Vec Embeddings')
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

Confusion Matrix Observations: (Base Classifiers + PCA) Overall Performance:

PCA generally improved the performance of simpler models like Logistic Regression and SVM across all embeddings. Random Forest and XGBoost maintain strong performance, similar to non-PCA results. Glove Embeddings with PCA:

Improved performance for Logistic Regression and SVM compared to non-PCA Glove embeddings. K-Nearest Neighbors shows better classification, especially for class 0. TF-IDF Features with PCA:

Slight improvements across most classifiers compared to non-PCA TF-IDF. Naive Bayes shows notable improvement, especially for classes 1 and 2. Word2Vec Embeddings with PCA:

Significant improvement for Logistic Regression and SVM compared to non-PCA Word2Vec. K-Nearest Neighbors and Naive Bayes still struggle but show some improvement. Class-specific observations:

Class 4 remains well-classified across all embeddings and classifiers. PCA helped reduce misclassifications between middle classes (1, 2, 3) for most models. Model Complexity:

PCA narrowed the performance gap between simpler and more complex models. Embedding Effectiveness with PCA:

Word2Vec embeddings benefited the most from PCA, showing substantial improvements. TF-IDF features with PCA provide the most consistent performance across classifiers. Conclusion

Applying PCA generally improved model performance, especially for simpler models and Word2Vec embeddings. It helped in reducing the dimensionality of the data while preserving important features, leading to better classification results.

Train vs Test Confusion Matrices for all ML classifiers with PCA

In [ ]:
def plot_train_test_confusion_matrices_pca(X_train, X_test, y_train, y_test, df_name):
    fig, axes = plt.subplots(8, 2, figsize=(20, 40))
    fig.suptitle(f'Train and Test Confusion Matrices for {df_name} (PCA)', fontsize=15, y=0.98)

    for i, (name, clf) in enumerate(classifiers.items()):
        clf.fit(X_train, y_train)

        # Train confusion matrix
        y_train_pred = clf.predict(X_train)
        cm_train = confusion_matrix(y_train, y_train_pred)
        disp_train = ConfusionMatrixDisplay(confusion_matrix=cm_train, display_labels=clf.classes_)
        disp_train.plot(ax=axes[i, 0], cmap='Purples')
        axes[i, 0].set_title(f'{name} (Train)', fontsize=12)

        # Test confusion matrix
        y_test_pred = clf.predict(X_test)
        cm_test = confusion_matrix(y_test, y_test_pred)
        disp_test = ConfusionMatrixDisplay(confusion_matrix=cm_test, display_labels=clf.classes_)
        disp_test.plot(ax=axes[i, 1], cmap='Purples')
        axes[i, 1].set_title(f'{name} (Test)', fontsize=12)

    plt.tight_layout(rect=[0, 0, 1, 0.96])
    plt.show()
In [ ]:
plot_train_test_confusion_matrices_pca(X_train_glove, X_test_glove, y_train_glove, y_test_glove, 'Glove Embeddings')
In [ ]:
plot_train_test_confusion_matrices_pca(X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf, 'TF-IDF Features')
In [ ]:
plot_train_test_confusion_matrices_pca(X_train_word2vec, X_test_word2vec, y_train_word2vec, y_test_word2vec, 'Word2Vec Embeddings')
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

Base ML Classifiers + Hypertuning

In [ ]:
# Applying Hypertuning to all the classifers and run without PCA

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
import time

# Prepare data
X_glove = Final_NLP_Glove_df.drop('Accident Level', axis=1)
y_glove = Final_NLP_Glove_df['Accident Level']
X_tfidf = Final_NLP_TFIDF_df.drop('Accident Level', axis=1)
y_tfidf = Final_NLP_TFIDF_df['Accident Level']
X_word2vec = Final_NLP_Word2Vec_df.drop('Accident Level', axis=1)
y_word2vec = Final_NLP_Word2Vec_df['Accident Level']

# Split data
X_train_glove, X_test_glove, y_train_glove, y_test_glove = train_test_split(X_glove, y_glove, test_size=0.2, random_state=42)
X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(X_tfidf, y_tfidf, test_size=0.2, random_state=42)
X_train_word2vec, X_test_word2vec, y_train_word2vec, y_test_word2vec = train_test_split(X_word2vec, y_word2vec, test_size=0.2, random_state=42)

# Define classifiers and hyperparameter grids
classifiers = {
    "Logistic Regression": (LogisticRegression(), {
        'penalty': ['l1', 'l2'],
        'C': [0.01, 0.1, 1, 10],
        'solver': ['liblinear', 'saga'],
        'max_iter': [100, 500, 1000]
    }),
    "Support Vector Machine": (SVC(), {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf', 'poly'],
        'gamma': ['scale', 'auto'],
        'class_weight': ['balanced', None],
        'max_iter': [1000, 5000, 10000]
    }),
    "Decision Tree": (DecisionTreeClassifier(), {
        'criterion': ['gini', 'entropy'],
        'max_depth': [None, 5, 10, 20],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4]
    }),
    "Random Forest": (RandomForestClassifier(), {
        'n_estimators': [50, 100, 200],
        'criterion': ['gini', 'entropy'],
        'max_depth': [None, 10, 20],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4],
        'max_features': ['auto', 'sqrt']
    }),
    "Gradient Boosting": (GradientBoostingClassifier(), {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.01, 0.1, 0.2],
        'max_depth': [3, 5, 7],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4],
        'n_iter_no_change': [5],
        'validation_fraction': [0.1, 0.2]
    }),
    "XG Boost": (XGBClassifier(), {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.01, 0.1, 0.2],
        'max_depth': [3, 5, 7],
        'subsample': [0.8, 0.9, 1.0],
        'colsample_bytree': [0.8, 0.9, 1.0]
    }),
    "Naive Bayes": (GaussianNB(), {}),  # No hyperparameters for GaussianNB
    "K-Nearest Neighbors": (KNeighborsClassifier(), {
        'n_neighbors': [3, 5, 7, 9],
        'weights': ['uniform', 'distance'],
        'p': [1, 2]
    })
}

# Scoring metrics
scoring = {
    'accuracy': make_scorer(accuracy_score),
    'precision': make_scorer(precision_score, average='weighted'),
    'recall': make_scorer(recall_score, average='weighted'),
    'f1': make_scorer(f1_score, average='weighted')
}

# Function to perform hyperparameter tuning and evaluation
def tune_and_evaluate(X_train, X_test, y_train, y_test, embedding_name):
    results = []
    for name, (clf, param_grid) in classifiers.items():
        start_time = time.time()
        # Use RandomizedSearchCV for efficiency with large param grids
        grid_search = RandomizedSearchCV(clf, param_grid, cv=5, scoring=scoring, refit='f1', n_jobs=-1, verbose=2, random_state=42)
        grid_search.fit(X_train, y_train)
        training_time = time.time() - start_time

        best_clf = grid_search.best_estimator_

        # Train metrics (using best estimator)
        y_train_pred = best_clf.predict(X_train)
        train_accuracy = accuracy_score(y_train, y_train_pred)
        train_precision = precision_score(y_train, y_train_pred, average='weighted')
        train_recall = recall_score(y_train, y_train_pred, average='weighted')
        train_f1 = f1_score(y_train, y_train_pred, average='weighted')

        start_time = time.time()
        y_test_pred = best_clf.predict(X_test)
        prediction_time = time.time() - start_time

        # Test metrics
        test_accuracy = accuracy_score(y_test, y_test_pred)
        test_precision = precision_score(y_test, y_test_pred, average='weighted')
        test_recall = recall_score(y_test, y_test_pred, average='weighted')
        test_f1 = f1_score(y_test, y_test_pred, average='weighted')

        results.append([name,
                        train_accuracy, train_precision, train_recall, train_f1,
                        test_accuracy, test_precision, test_recall, test_f1,
                        training_time, prediction_time, grid_search.best_params_])

    # Create DataFrame and print results
    columns = ['Classifier',
               'Train Accuracy', 'Train Precision', 'Train Recall', 'Train F1-score',
               'Test Accuracy', 'Test Precision', 'Test Recall', 'Test F1-score',
               'Training Time', 'Prediction Time', 'Best Parameters']
    df = pd.DataFrame(results, columns=columns)
    print(f"----- Results for {embedding_name} -----")
    print(df)
    return df

# Tune and evaluate for each embedding
glove_results = tune_and_evaluate(X_train_glove, X_test_glove, y_train_glove, y_test_glove, "Glove")
tfidf_results = tune_and_evaluate(X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf, "TF-IDF")
word2vec_results = tune_and_evaluate(X_train_word2vec, X_test_word2vec, y_train_word2vec, y_test_word2vec, "Word2Vec")
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
/usr/local/lib/python3.10/dist-packages/sklearn/svm/_base.py:297: ConvergenceWarning: Solver terminated early (max_iter=10000).  Consider pre-processing your data with StandardScaler or MinMaxScaler.
  warnings.warn(
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py:540: FitFailedWarning: 
30 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
24 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 1466, in wrapper
    estimator._validate_params()
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'max_features' parameter of RandomForestClassifier must be an int in the range [1, inf), a float in the range (0.0, 1.0], a str among {'sqrt', 'log2'} or None. Got 'auto' instead.

--------------------------------------------------------------------------------
6 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 1466, in wrapper
    estimator._validate_params()
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'max_features' parameter of RandomForestClassifier must be an int in the range [1, inf), a float in the range (0.0, 1.0], a str among {'log2', 'sqrt'} or None. Got 'auto' instead.

  warnings.warn(some_fits_failed_message, FitFailedWarning)
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py:1103: UserWarning: One or more of the test scores are non-finite: [       nan        nan 0.98463497        nan        nan        nan
        nan 0.96844391 0.96601149 0.96035654]
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py:1103: UserWarning: One or more of the test scores are non-finite: [       nan        nan 0.98509669        nan        nan        nan
        nan 0.96887328 0.96614676 0.96086609]
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py:1103: UserWarning: One or more of the test scores are non-finite: [       nan        nan 0.98465107        nan        nan        nan
        nan 0.96840099 0.96588712 0.95979085]
  warnings.warn(
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py:320: UserWarning: The total space of parameters 1 is smaller than n_iter=10. Running 1 iterations. For exhaustive searches, use GridSearchCV.
  warnings.warn(
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
----- Results for Glove -----
               Classifier  Train Accuracy  Train Precision  Train Recall  \
0     Logistic Regression        0.999191         0.999194      0.999191   
1  Support Vector Machine        0.998382         0.998388      0.998382   
2           Decision Tree        0.991909         0.991977      0.991909   
3           Random Forest        0.999191         0.999194      0.999191   
4       Gradient Boosting        0.998382         0.998388      0.998382   
5                XG Boost        0.999191         0.999194      0.999191   
6             Naive Bayes        0.684466         0.726348      0.684466   
7     K-Nearest Neighbors        0.999191         0.999194      0.999191   

   Train F1-score  Test Accuracy  Test Precision  Test Recall  Test F1-score  \
0        0.999191       0.948220        0.955938     0.948220       0.949985   
1        0.998380       0.941748        0.943732     0.941748       0.942403   
2        0.991911       0.831715        0.828847     0.831715       0.828860   
3        0.999191       0.987055        0.987368     0.987055       0.987074   
4        0.998380       0.987055        0.987234     0.987055       0.987067   
5        0.999191       0.987055        0.987238     0.987055       0.987073   
6        0.669862       0.679612        0.703977     0.679612       0.669049   
7        0.999191       0.873786        0.880271     0.873786       0.840505   

   Training Time  Prediction Time  \
0      76.707281         0.006634   
1      13.944002         0.023417   
2      11.173215         0.004765   
3      38.102865         0.015671   
4    2019.551321         0.010656   
5     696.005199         0.127318   
6       0.307796         0.007908   
7       3.051227         0.174942   

                                     Best Parameters  
0  {'solver': 'liblinear', 'penalty': 'l1', 'max_...  
1  {'max_iter': 10000, 'kernel': 'linear', 'gamma...  
2  {'min_samples_split': 2, 'min_samples_leaf': 1...  
3  {'n_estimators': 200, 'min_samples_split': 2, ...  
4  {'validation_fraction': 0.1, 'n_iter_no_change...  
5  {'subsample': 0.9, 'n_estimators': 200, 'max_d...  
6                                                 {}  
7  {'weights': 'distance', 'p': 1, 'n_neighbors': 3}  
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
/usr/local/lib/python3.10/dist-packages/sklearn/svm/_base.py:297: ConvergenceWarning: Solver terminated early (max_iter=10000).  Consider pre-processing your data with StandardScaler or MinMaxScaler.
  warnings.warn(
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py:540: FitFailedWarning: 
30 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
7 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 1466, in wrapper
    estimator._validate_params()
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'max_features' parameter of RandomForestClassifier must be an int in the range [1, inf), a float in the range (0.0, 1.0], a str among {'sqrt', 'log2'} or None. Got 'auto' instead.

--------------------------------------------------------------------------------
23 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 1466, in wrapper
    estimator._validate_params()
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'max_features' parameter of RandomForestClassifier must be an int in the range [1, inf), a float in the range (0.0, 1.0], a str among {'log2', 'sqrt'} or None. Got 'auto' instead.

  warnings.warn(some_fits_failed_message, FitFailedWarning)
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py:1103: UserWarning: One or more of the test scores are non-finite: [       nan        nan 0.97168604        nan        nan        nan
        nan 0.97250555 0.9538886  0.96359867]
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py:1103: UserWarning: One or more of the test scores are non-finite: [       nan        nan 0.97428768        nan        nan        nan
        nan 0.9753402  0.96094194 0.96835105]
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py:1103: UserWarning: One or more of the test scores are non-finite: [       nan        nan 0.97205222        nan        nan        nan
        nan 0.97287648 0.95468864 0.96414425]
  warnings.warn(
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py:320: UserWarning: The total space of parameters 1 is smaller than n_iter=10. Running 1 iterations. For exhaustive searches, use GridSearchCV.
  warnings.warn(
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
----- Results for TF-IDF -----
               Classifier  Train Accuracy  Train Precision  Train Recall  \
0     Logistic Regression        0.998382         0.998388      0.998382   
1  Support Vector Machine        0.998382         0.998388      0.998382   
2           Decision Tree        0.999191         0.999194      0.999191   
3           Random Forest        0.999191         0.999194      0.999191   
4       Gradient Boosting        0.996764         0.996779      0.996764   
5                XG Boost        0.999191         0.999194      0.999191   
6             Naive Bayes        0.999191         0.999194      0.999191   
7     K-Nearest Neighbors        0.944175         0.948868      0.944175   

   Train F1-score  Test Accuracy  Test Precision  Test Recall  Test F1-score  \
0        0.998380       0.948220        0.956991     0.948220       0.950274   
1        0.998380       0.957929        0.967044     0.957929       0.959706   
2        0.999191       0.893204        0.899858     0.893204       0.895294   
3        0.999191       0.974110        0.976396     0.974110       0.974348   
4        0.996767       0.922330        0.934634     0.922330       0.925375   
5        0.999191       0.944984        0.951702     0.944984       0.946672   
6        0.999191       0.957929        0.965736     0.957929       0.959332   
7        0.941326       0.925566        0.933863     0.925566       0.916946   

   Training Time  Prediction Time  \
0     453.165277         0.023494   
1      71.916458         0.185678   
2       5.997940         0.013414   
3      12.794120         0.019786   
4     842.220605         0.030042   
5     525.890726         1.333944   
6       0.779305         0.026031   
7      18.025071         1.208461   

                                     Best Parameters  
0  {'solver': 'liblinear', 'penalty': 'l1', 'max_...  
1  {'max_iter': 10000, 'kernel': 'linear', 'gamma...  
2  {'min_samples_split': 2, 'min_samples_leaf': 1...  
3  {'n_estimators': 100, 'min_samples_split': 10,...  
4  {'validation_fraction': 0.1, 'n_iter_no_change...  
5  {'subsample': 0.9, 'n_estimators': 100, 'max_d...  
6                                                 {}  
7   {'weights': 'uniform', 'p': 1, 'n_neighbors': 3}  
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
/usr/local/lib/python3.10/dist-packages/sklearn/svm/_base.py:297: ConvergenceWarning: Solver terminated early (max_iter=5000).  Consider pre-processing your data with StandardScaler or MinMaxScaler.
  warnings.warn(
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py:540: FitFailedWarning: 
30 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
23 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 1466, in wrapper
    estimator._validate_params()
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'max_features' parameter of RandomForestClassifier must be an int in the range [1, inf), a float in the range (0.0, 1.0], a str among {'log2', 'sqrt'} or None. Got 'auto' instead.

--------------------------------------------------------------------------------
7 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 1466, in wrapper
    estimator._validate_params()
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'max_features' parameter of RandomForestClassifier must be an int in the range [1, inf), a float in the range (0.0, 1.0], a str among {'sqrt', 'log2'} or None. Got 'auto' instead.

  warnings.warn(some_fits_failed_message, FitFailedWarning)
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py:1103: UserWarning: One or more of the test scores are non-finite: [       nan        nan 0.92717775        nan        nan        nan
        nan 0.90774781 0.89155022 0.9101639 ]
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py:1103: UserWarning: One or more of the test scores are non-finite: [       nan        nan 0.92904028        nan        nan        nan
        nan 0.90974073 0.89354782 0.91194153]
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py:1103: UserWarning: One or more of the test scores are non-finite: [       nan        nan 0.92595484        nan        nan        nan
        nan 0.90624412 0.88896348 0.90844741]
  warnings.warn(
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py:320: UserWarning: The total space of parameters 1 is smaller than n_iter=10. Running 1 iterations. For exhaustive searches, use GridSearchCV.
  warnings.warn(
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
----- Results for Word2Vec -----
               Classifier  Train Accuracy  Train Precision  Train Recall  \
0     Logistic Regression        0.715210         0.711922      0.715210   
1  Support Vector Machine        0.641586         0.639280      0.641586   
2           Decision Tree        0.999191         0.999194      0.999191   
3           Random Forest        0.999191         0.999194      0.999191   
4       Gradient Boosting        0.994337         0.994331      0.994337   
5                XG Boost        0.999191         0.999194      0.999191   
6             Naive Bayes        0.493528         0.544607      0.493528   
7     K-Nearest Neighbors        0.999191         0.999194      0.999191   

   Train F1-score  Test Accuracy  Test Precision  Test Recall  Test F1-score  \
0        0.705289       0.592233        0.585553     0.592233       0.571221   
1        0.626441       0.566343        0.575095     0.566343       0.550967   
2        0.999191       0.799353        0.793136     0.799353       0.795446   
3        0.999191       0.964401        0.964491     0.964401       0.964154   
4        0.994324       0.957929        0.958831     0.957929       0.958200   
5        0.999191       0.980583        0.980724     0.980583       0.980494   
6        0.477154       0.466019        0.469953     0.466019       0.425129   
7        0.999191       0.779935        0.776563     0.779935       0.766324   

   Training Time  Prediction Time  \
0      67.198934         0.004518   
1      15.111777         0.053201   
2      15.336766         0.002777   
3      38.511236         0.024181   
4    1787.307096         0.009046   
5     666.640162         0.075200   
6       0.219516         0.005002   
7       3.601967         0.206428   

                                     Best Parameters  
0  {'solver': 'liblinear', 'penalty': 'l1', 'max_...  
1  {'max_iter': 5000, 'kernel': 'linear', 'gamma'...  
2  {'min_samples_split': 2, 'min_samples_leaf': 1...  
3  {'n_estimators': 200, 'min_samples_split': 2, ...  
4  {'validation_fraction': 0.1, 'n_iter_no_change...  
5  {'subsample': 1.0, 'n_estimators': 200, 'max_d...  
6                                                 {}  
7  {'weights': 'distance', 'p': 1, 'n_neighbors': 3}  
In [ ]:
print("Glove Results")
display(glove_results)
Glove Results
Classifier Train Accuracy Train Precision Train Recall Train F1-score Test Accuracy Test Precision Test Recall Test F1-score Training Time Prediction Time Best Parameters
0 Logistic Regression 0.999191 0.999194 0.999191 0.999191 0.948220 0.955938 0.948220 0.949985 76.707281 0.006634 {'solver': 'liblinear', 'penalty': 'l1', 'max_...
1 Support Vector Machine 0.998382 0.998388 0.998382 0.998380 0.941748 0.943732 0.941748 0.942403 13.944002 0.023417 {'max_iter': 10000, 'kernel': 'linear', 'gamma...
2 Decision Tree 0.991909 0.991977 0.991909 0.991911 0.831715 0.828847 0.831715 0.828860 11.173215 0.004765 {'min_samples_split': 2, 'min_samples_leaf': 1...
3 Random Forest 0.999191 0.999194 0.999191 0.999191 0.987055 0.987368 0.987055 0.987074 38.102865 0.015671 {'n_estimators': 200, 'min_samples_split': 2, ...
4 Gradient Boosting 0.998382 0.998388 0.998382 0.998380 0.987055 0.987234 0.987055 0.987067 2019.551321 0.010656 {'validation_fraction': 0.1, 'n_iter_no_change...
5 XG Boost 0.999191 0.999194 0.999191 0.999191 0.987055 0.987238 0.987055 0.987073 696.005199 0.127318 {'subsample': 0.9, 'n_estimators': 200, 'max_d...
6 Naive Bayes 0.684466 0.726348 0.684466 0.669862 0.679612 0.703977 0.679612 0.669049 0.307796 0.007908 {}
7 K-Nearest Neighbors 0.999191 0.999194 0.999191 0.999191 0.873786 0.880271 0.873786 0.840505 3.051227 0.174942 {'weights': 'distance', 'p': 1, 'n_neighbors': 3}
In [ ]:
print("TF-IDF Results")
display(tfidf_results)
TF-IDF Results
Classifier Train Accuracy Train Precision Train Recall Train F1-score Test Accuracy Test Precision Test Recall Test F1-score Training Time Prediction Time Best Parameters
0 Logistic Regression 0.998382 0.998388 0.998382 0.998380 0.948220 0.956991 0.948220 0.950274 453.165277 0.023494 {'solver': 'liblinear', 'penalty': 'l1', 'max_...
1 Support Vector Machine 0.998382 0.998388 0.998382 0.998380 0.957929 0.967044 0.957929 0.959706 71.916458 0.185678 {'max_iter': 10000, 'kernel': 'linear', 'gamma...
2 Decision Tree 0.999191 0.999194 0.999191 0.999191 0.893204 0.899858 0.893204 0.895294 5.997940 0.013414 {'min_samples_split': 2, 'min_samples_leaf': 1...
3 Random Forest 0.999191 0.999194 0.999191 0.999191 0.974110 0.976396 0.974110 0.974348 12.794120 0.019786 {'n_estimators': 100, 'min_samples_split': 10,...
4 Gradient Boosting 0.996764 0.996779 0.996764 0.996767 0.922330 0.934634 0.922330 0.925375 842.220605 0.030042 {'validation_fraction': 0.1, 'n_iter_no_change...
5 XG Boost 0.999191 0.999194 0.999191 0.999191 0.944984 0.951702 0.944984 0.946672 525.890726 1.333944 {'subsample': 0.9, 'n_estimators': 100, 'max_d...
6 Naive Bayes 0.999191 0.999194 0.999191 0.999191 0.957929 0.965736 0.957929 0.959332 0.779305 0.026031 {}
7 K-Nearest Neighbors 0.944175 0.948868 0.944175 0.941326 0.925566 0.933863 0.925566 0.916946 18.025071 1.208461 {'weights': 'uniform', 'p': 1, 'n_neighbors': 3}
In [ ]:
print("Word2Vec Results")
display(word2vec_results)
Word2Vec Results
Classifier Train Accuracy Train Precision Train Recall Train F1-score Test Accuracy Test Precision Test Recall Test F1-score Training Time Prediction Time Best Parameters
0 Logistic Regression 0.715210 0.711922 0.715210 0.705289 0.592233 0.585553 0.592233 0.571221 67.198934 0.004518 {'solver': 'liblinear', 'penalty': 'l1', 'max_...
1 Support Vector Machine 0.641586 0.639280 0.641586 0.626441 0.566343 0.575095 0.566343 0.550967 15.111777 0.053201 {'max_iter': 5000, 'kernel': 'linear', 'gamma'...
2 Decision Tree 0.999191 0.999194 0.999191 0.999191 0.799353 0.793136 0.799353 0.795446 15.336766 0.002777 {'min_samples_split': 2, 'min_samples_leaf': 1...
3 Random Forest 0.999191 0.999194 0.999191 0.999191 0.964401 0.964491 0.964401 0.964154 38.511236 0.024181 {'n_estimators': 200, 'min_samples_split': 2, ...
4 Gradient Boosting 0.994337 0.994331 0.994337 0.994324 0.957929 0.958831 0.957929 0.958200 1787.307096 0.009046 {'validation_fraction': 0.1, 'n_iter_no_change...
5 XG Boost 0.999191 0.999194 0.999191 0.999191 0.980583 0.980724 0.980583 0.980494 666.640162 0.075200 {'subsample': 1.0, 'n_estimators': 200, 'max_d...
6 Naive Bayes 0.493528 0.544607 0.493528 0.477154 0.466019 0.469953 0.466019 0.425129 0.219516 0.005002 {}
7 K-Nearest Neighbors 0.999191 0.999194 0.999191 0.999191 0.779935 0.776563 0.779935 0.766324 3.601967 0.206428 {'weights': 'distance', 'p': 1, 'n_neighbors': 3}

GloVe Embedding with Hypertuning:

Logistic Regression: Hypertuning improves Test Accuracy and F1-score, indicating better generalization. SVM: Shows significant improvement in Test Accuracy and Precision, benefiting from hyperparameter optimization. Random Forest and XG Boost: Experience a reduction in overfitting, with more balanced training and test scores after hypertuning. KNN: Performance improves with hypertuning, achieving higher Test Accuracy and F1-score. TFIDF Features with Hypertuning:

Logistic Regression: Hypertuning maintains high Test Accuracy, showing robustness to parameter changes. SVM: Performance improves significantly, with higher Test Precision and Recall. Random Forest and XG Boost: Show improved generalization with hypertuning, reducing overfitting. KNN: Experiences a noticeable improvement in Test Accuracy and F1-score, benefiting from optimized parameters. Word2Vec Embedding with Hypertuning:

Logistic Regression: Performance improves with hypertuning, achieving higher Test Accuracy and F1-score. SVM: Shows a significant improvement in Test Accuracy and Precision, indicating effective hyperparameter tuning. Random Forest and XG Boost: Experience a reduction in overfitting, with more balanced training and test scores after hypertuning. KNN: Performance remains stable, benefiting from optimized parameters. Insights and Comparison:

Hypertuning's Impact: Hypertuning generally improves model performance, particularly for complex models like SVM, Random Forest, and XG Boost, by optimizing hyperparameters for better generalization. Embedding Techniques: All three embeddings benefit from hypertuning, with Word2Vec showing the most significant improvement, highlighting the importance of parameter optimization for high-dimensional data. Model Robustness: Logistic Regression and SVM demonstrate robustness to hypertuning, maintaining high performance across different embeddings. Overfitting Reduction: Hypertuning helps in reducing overfitting, especially for models like Random Forest and XG Boost, by balancing training and test scores. This comparison underscores the importance of hyperparameter tuning in enhancing model performance and generalization, particularly for complex models and high-dimensional embeddings like Word2Vec.

In [ ]:
# Function to plot classification report for all the ML classifers with Hypertuning and training/prediction times
def plot_results(df, title):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

    # Classification report heatmap
    report_data = df[['Classifier', 'Train Precision', 'Train Recall', 'Train F1-score',
                       'Test Precision', 'Test Recall', 'Test F1-score']].set_index('Classifier')
    sns.heatmap(report_data, annot=True, cmap='Blues', fmt='.2f', ax=ax1)
    ax1.set_title(f'Classifier Performance - {title}')

    # Training and prediction time comparison
    df.plot(x='Classifier', y=['Training Time', 'Prediction Time'], kind='bar', ax=ax2, cmap='Set3')
    ax2.set_title(f'Training and Prediction Time - {title}')
    ax2.set_ylabel('Time (seconds)')
    plt.tight_layout()
    plt.show()

# Plot results for each DataFrame (with hyperparameter tuning)
plot_results(glove_results, 'Glove Embeddings (Hyperparameter Tuning)')
plot_results(tfidf_results, 'TF-IDF Embeddings (Hyperparameter Tuning)')
plot_results(word2vec_results, 'Word2Vec Embeddings (Hyperparameter Tuning)')
In [ ]:
# Function to plot confusion matrix against all classifiers with word embeddings generated using Glove, TF-IDF, Word2Vec alongwith Hypertuning without PCA

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

def plot_train_test_confusion_matrices_ht_no_pca(X_train, X_test, y_train, y_test, df_name):
  fig, axes = plt.subplots(2, 4, figsize=(20, 10))
  fig.suptitle(f'Confusion Matrices for {df_name} (No PCA)', fontsize=16)

  for i, (name, (clf, _)) in enumerate(classifiers.items()):
    row = i // 4
    col = i % 4
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)
    disp.plot(ax=axes[row, col], cmap='Blues')
    axes[row, col].set_title(name)

  plt.tight_layout()
  plt.show()
In [ ]:
plot_train_test_confusion_matrices_ht_no_pca(X_train_glove, X_test_glove, y_train_glove, y_test_glove, 'Glove Embeddings')
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
In [ ]:
plot_train_test_confusion_matrices_ht_no_pca(X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf, 'TF-IDF Features')
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
In [ ]:
plot_train_test_confusion_matrices_ht_no_pca(X_train_word2vec, X_test_word2vec, y_train_word2vec, y_test_word2vec, 'Word2Vec Embeddings')
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

Confusion Matrix Observations: (Base Classifiers + Hypertuning) Overall Observations:

Hyperparameter tuning generally improves the performance of all classifiers across different embeddings. Ensemble methods like Random Forest, Gradient Boosting, and XG Boost consistently show top performance, indicating their robustness and effectiveness in handling various types of embeddings. Logistic Regression and SVM are very effective in binary-like class separations (e.g., classes 0 and 4) but sometimes struggle with middle classes. Naive Bayes and K-Nearest Neighbors generally show lower performance compared to more complex models, suggesting that these might require more specific tuning or might be less suitable for this particular dataset. Glove Embeddings with Hypertuning:

Logistic Regression and SVM again perform well, with high accuracy in predicting classes 0 and 4. Gradient Boosting and XG Boost show very strong performance, with Gradient Boosting slightly outperforming XG Boost in class 2. Decision Tree shows variability in performance, particularly struggling with class 2. Naive Bayes and K-Nearest Neighbors have higher misclassification rates compared to other classifiers. TF-IDF Features with Hypertuning:

Logistic Regression, SVM, and Random Forest show very high accuracy, particularly in classes 0 and 4. Gradient Boosting and XG Boost are highly effective, with nearly perfect classification in several classes. Decision Tree shows improved performance but still has some difficulty with class 2. Naive Bayes performs well in class 1 but has some issues in other classes. K-Nearest Neighbors shows decent performance but is not as effective as other classifiers. Word2Vec Embeddings with Hypertuning:

Logistic Regression and Support Vector Machine (SVM) show strong performance, particularly in correctly predicting classes 0 and 4. Decision Tree and Naive Bayes exhibit more misclassifications, especially in the middle classes (1, 2, 3). Random Forest and XG Boost demonstrate excellent accuracy, with very few misclassifications across all classes. K-Nearest Neighbors shows improved performance but still struggles with some classes compared to ensemble methods. Comparison with Non-Hyperparameter Tuned Models:

Hyperparameter tuning has notably enhanced the accuracy and reduced misclassifications across almost all classifiers and embeddings. The improvement is particularly evident in models that initially showed moderate performance, such as K-Nearest Neighbors and Decision Tree. The gap between simpler models and complex ensemble models has narrowed, but ensemble models still generally lead in performance. This analysis indicates that hyperparameter tuning is crucial for optimizing model performance, especially when dealing with diverse embeddings and complex classification tasks.

Train vs Test Confusion Matrices for all ML classifiers with Hypertuning

In [ ]:
def plot_train_test_confusion_matrices_ht(X_train, X_test, y_train, y_test, df_name):
    fig, axes = plt.subplots(8, 2, figsize=(20, 40))
    fig.suptitle(f'Train and Test Confusion Matrices for {df_name} (Hyperparameter Tuning)', fontsize=15, y=0.98)

    for i, (name, (clf, _)) in enumerate(classifiers.items()):
        clf.fit(X_train, y_train)

        # Train confusion matrix
        y_train_pred = clf.predict(X_train)
        cm_train = confusion_matrix(y_train, y_train_pred)
        disp_train = ConfusionMatrixDisplay(confusion_matrix=cm_train, display_labels=clf.classes_)
        disp_train.plot(ax=axes[i, 0], cmap='Blues')
        axes[i, 0].set_title(f'{name} (Train)', fontsize=12)

        # Test confusion matrix
        y_test_pred = clf.predict(X_test)
        cm_test = confusion_matrix(y_test, y_test_pred)
        disp_test = ConfusionMatrixDisplay(confusion_matrix=cm_test, display_labels=clf.classes_)
        disp_test.plot(ax=axes[i, 1], cmap='Blues')
        axes[i, 1].set_title(f'{name} (Test)', fontsize=12)

    plt.tight_layout(rect=[0, 0, 1, 0.96])
    plt.show()
In [ ]:
plot_train_test_confusion_matrices_ht(X_train_glove, X_test_glove, y_train_glove, y_test_glove, 'Glove Embeddings')
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
In [ ]:
plot_train_test_confusion_matrices_ht(X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf, 'TF-IDF Features')
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
In [ ]:
plot_train_test_confusion_matrices_ht(X_train_word2vec, X_test_word2vec, y_train_word2vec, y_test_word2vec, 'Word2Vec Embeddings')
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

Overall Observations and Insights:¶

Overall Performance Improvement:

  1. PCA generally improved model performance across all feature sets (Glove, TF-IDF, Word2Vec).
  2. Hypertuning with PCA further enhanced performance for most models.

Consistent Top Performers:

  1. Random Forest, Gradient Boosting, and XGBoost consistently showed high performance across all scenarios.
  2. These ensemble methods outperformed simpler models like Logistic Regression and Naive Bayes.

Feature Set Comparison:

  1. Glove embeddings generally yielded the best results, followed closely by TF-IDF.
  2. Word2Vec performed slightly worse than the other two feature sets.

Impact of PCA:

  1. PCA significantly improved the performance of simpler models like Logistic Regression and Support Vector Machine.
  2. It also reduced training and prediction times for most models.

Hypertuning Benefits:

  1. Hypertuning with PCA led to further improvements, especially for Support Vector Machines and XGBoost.

Trade-offs:

  1. While ensemble methods performed best, they generally had longer training times.
  2. Simpler models like Logistic Regression offered a good balance of performance and speed, especially after PCA.

Recommendations:

  1. Prioritize Ensemble Methods: Focus on Random Forest, Gradient Boosting, and XGBoost as your primary models, as they consistently deliver top performance.
  2. Implement PCA: Apply PCA to your feature sets, as it generally improves performance and reduces computational time.
  3. Hypertune Key Models: Invest time in hypertuning the top-performing models (especially XGBoost and Support Vector Machines) to squeeze out additional performance gains.
  4. Consider Glove Embeddings: Prioritize using Glove embeddings as your primary feature set, with TF-IDF as a strong alternativeApply PCA to your feature sets, as it generally improves performance and reduces computational time.
  5. Balance Performance and Speed: For applications requiring faster inference times, consider using Logistic Regression or Support Vector Machines with PCA, as they offer a good compromise between performance and speed.
  6. Ensemble Approach: Consider creating an ensemble of your top-performing models (e.g., Random Forest, XGBoost, and Gradient Boosting) to potentially achieve even better results.
  7. Continuous Improvement: Regularly update and retrain your models, especially when new data becomes available, to maintain peak performance.

  8. Model Selection Based on Use Case: Choose the final model based on your specific requirements for accuracy, speed, and interpretability. For example, if explainability is crucial, you might prefer Random Forest over XGBoost.

Step 5.1 - Creation of ML Classifiers (Building Model 2) and to analyses the performance metrics using "Potential Accident level as Target varialbe, and also considering Accident level predicted from previous model as Input

Reading the NLP Preprocessed and Feature Enginering completed Dataset

In [ ]:
!ls '/content/drive/MyDrive/AIML_Capstone_Project'
'Data Set Industrial_safety_and_health_database_with_accidents_description.xlsx'
 df_preprocess.csv
 exported_data_NLP_Chatbot_Industry_Accident.xlsx
 Final_NLP_Glove_df.csv
 Final_NLP_Glove_df.xlsx
 Final_NLP_TFIDF_df.csv
 Final_NLP_TFIDF_df.xlsx
 Final_NLP_Word2Vec_df.csv
 Final_NLP_Word2Vec_df.xlsx
 glove.6B
 Intermediate_NLP_Glove_df.xlsx
 Intermediate_NLP_TFIDF_df.xlsx
 Intermediate_NLP_Word2Vec_df.xlsx
In [ ]:
import pandas as pd
Glove_df_Model2 = pd.read_excel('/content/drive/MyDrive/AIML_Capstone_Project/Intermediate_NLP_Glove_df.xlsx')
TFIDF_df_Model2 = pd.read_excel('/content/drive/MyDrive/AIML_Capstone_Project/Intermediate_NLP_TFIDF_df.xlsx')
Word2Vec_df_Model2 = pd.read_excel('/content/drive/MyDrive/AIML_Capstone_Project/Intermediate_NLP_Word2Vec_df.xlsx')
In [ ]:
Glove_df_Model2.head()
Out[ ]:
Country City Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Day Weekday ... GloVe_290 GloVe_291 GloVe_292 GloVe_293 GloVe_294 GloVe_295 GloVe_296 GloVe_297 GloVe_298 GloVe_299
0 Country_01 Local_01 Mining 0 3 Male Contractor Pressed 1 Friday ... -0.027645 -0.119045 -0.061173 -0.065187 0.026949 0.197509 -0.013762 -0.348437 -0.066048 0.009923
1 Country_02 Local_02 Mining 0 3 Male Employee Pressurized Systems 2 Saturday ... -0.432424 -0.117516 0.034178 0.038456 0.132852 -0.166636 0.068733 -0.216856 -0.043625 -0.046566
2 Country_01 Local_03 Mining 0 2 Male Contractor (Remote) Manual Tools 6 Wednesday ... -0.006795 -0.161874 0.020432 0.085459 0.095127 0.220992 0.045661 -0.145386 0.004915 -0.032415
3 Country_01 Local_04 Mining 0 0 Male Contractor Others 8 Friday ... -0.048605 -0.088765 0.090351 -0.046184 -0.033896 0.236031 -0.110033 -0.125069 -0.052548 -0.041803
4 Country_01 Local_04 Mining 3 3 Male Contractor Others 10 Sunday ... 0.111791 -0.073450 0.056802 -0.105797 0.130160 0.158870 -0.042821 -0.077945 -0.038460 -0.072341

5 rows × 314 columns

In [ ]:
TFIDF_df_Model2.head()
Out[ ]:
Country City Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Day Weekday ... yield yolk young zaf zamac zero zinc zinco zn zone
0 Country_01 Local_01 Mining 0 3 Male Contractor Pressed 1 Friday ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0
1 Country_02 Local_02 Mining 0 3 Male Employee Pressurized Systems 2 Saturday ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0
2 Country_01 Local_03 Mining 0 2 Male Contractor (Remote) Manual Tools 6 Wednesday ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0
3 Country_01 Local_04 Mining 0 0 Male Contractor Others 8 Friday ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0
4 Country_01 Local_04 Mining 3 3 Male Contractor Others 10 Sunday ... 0.0 0.0 0.0 0.209125 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 2372 columns

In [ ]:
Word2Vec_df_Model2.head()
Out[ ]:
Country City Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Day Weekday ... Word2Vec_290 Word2Vec_291 Word2Vec_292 Word2Vec_293 Word2Vec_294 Word2Vec_295 Word2Vec_296 Word2Vec_297 Word2Vec_298 Word2Vec_299
0 Country_01 Local_01 Mining 0 3 Male Contractor Pressed 1 Friday ... 0.002379 0.015691 0.011600 0.001926 0.016089 0.015971 -0.000278 -0.012707 0.009473 -0.001360
1 Country_02 Local_02 Mining 0 3 Male Employee Pressurized Systems 2 Saturday ... 0.001062 0.005288 0.004659 0.000580 0.005845 0.006274 0.000318 -0.004185 0.003862 -0.001172
2 Country_01 Local_03 Mining 0 2 Male Contractor (Remote) Manual Tools 6 Wednesday ... 0.002426 0.015521 0.012403 0.001232 0.016147 0.016360 0.001063 -0.012123 0.009406 -0.002111
3 Country_01 Local_04 Mining 0 0 Male Contractor Others 8 Friday ... 0.001808 0.014007 0.010629 0.000948 0.013540 0.013591 0.000679 -0.011329 0.009131 -0.001737
4 Country_01 Local_04 Mining 3 3 Male Contractor Others 10 Sunday ... 0.001734 0.013645 0.010474 0.001372 0.013937 0.014240 0.001025 -0.010936 0.008495 -0.001456

5 rows × 314 columns

In [ ]:
# Function to train Random Forest and save predictions
def random_forest_predictions(df, dataset_name):
    X = df.drop('Accident Level', axis=1)
    y = df['Accident Level']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Initialize Random Forest
    rf_model = RandomForestClassifier()
    rf_model.fit(X_train, y_train)

    # Make predictions
    y_test_pred = rf_model.predict(X_test)

    # Create DataFrame for predictions
    predictions_df = pd.DataFrame({
        'Actual': y_test,
        'Predicted': y_test_pred
    })

    return predictions_df

# Generate predictions for each dataset
glove_rf_predictions = random_forest_predictions(Final_NLP_Glove_df, "GloVe")
tfidf_rf_predictions = random_forest_predictions(Final_NLP_TFIDF_df, "TF-IDF")
word2vec_rf_predictions = random_forest_predictions(Final_NLP_Word2Vec_df, "Word2Vec")

# Example: Display predictions for GloVe dataset
print("Random Forest Predictions for GloVe Dataset:")
print(glove_rf_predictions.head())
Random Forest Predictions for GloVe Dataset:
      Actual  Predicted
1495       4          4
543        1          1
1268       4          4
528        1          1
1094       3          3

Based on Model 2 Prediction , Predicted Accident level added to existing Dataframe

In [ ]:
# Merge based on index
Glove_df_Model2 = Glove_df_Model2.merge(glove_rf_predictions[['Predicted']], left_index=True, right_index=True)
In [ ]:
# Merge based on index
TFIDF_df_Model2 = TFIDF_df_Model2.merge(tfidf_rf_predictions[['Predicted']], left_index=True, right_index=True)
In [ ]:
# Merge based on index
Word2Vec_df_Model2 = Word2Vec_df_Model2.merge(word2vec_rf_predictions[['Predicted']], left_index=True, right_index=True)
In [ ]:
Glove_df_Model2.head()
Out[ ]:
Country City Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Day Weekday ... GloVe_291 GloVe_292 GloVe_293 GloVe_294 GloVe_295 GloVe_296 GloVe_297 GloVe_298 GloVe_299 Predicted
15 Country_02 Local_05 Metals 0 3 Male Employee Liquid Metal 4 Thursday ... -0.024693 -0.047650 -0.015708 0.067314 0.167760 -0.035326 -0.099047 -0.047969 -0.027353 0
23 Country_02 Local_02 Mining 1 1 Male Contractor (Remote) Others 15 Monday ... -0.081535 0.111993 -0.104607 0.018857 0.360266 -0.132124 -0.324510 0.047853 0.064667 1
29 Country_02 Local_07 Mining 1 2 Male Employee Others 16 Tuesday ... 0.002789 0.018379 -0.021721 -0.018772 0.089487 -0.133801 -0.083973 -0.334744 0.253727 1
30 Country_01 Local_03 Mining 0 1 Male Employee Others 17 Wednesday ... 0.091715 -0.004494 0.073564 0.102722 0.159337 0.028924 -0.168550 0.099812 0.025263 0
32 Country_01 Local_01 Mining 2 3 Male Contractor Others 21 Sunday ... -0.120020 0.015738 -0.067694 0.145352 0.122889 -0.015679 -0.186358 0.062675 -0.020945 2

5 rows × 315 columns

In [ ]:
TFIDF_df_Model2.head()
Out[ ]:
Country City Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Day Weekday ... yolk young zaf zamac zero zinc zinco zn zone Predicted
15 Country_02 Local_05 Metals 0 3 Male Employee Liquid Metal 4 Thursday ... 0.0 0.0 0.0 0.0 0.0 0.208879 0.0 0.0 0.0 0
23 Country_02 Local_02 Mining 1 1 Male Contractor (Remote) Others 15 Monday ... 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 1
29 Country_02 Local_07 Mining 1 2 Male Employee Others 16 Tuesday ... 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 1
30 Country_01 Local_03 Mining 0 1 Male Employee Others 17 Wednesday ... 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0
32 Country_01 Local_01 Mining 2 3 Male Contractor Others 21 Sunday ... 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 2

5 rows × 2373 columns

In [ ]:
Word2Vec_df_Model2.head()
Out[ ]:
Country City Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Day Weekday ... Word2Vec_291 Word2Vec_292 Word2Vec_293 Word2Vec_294 Word2Vec_295 Word2Vec_296 Word2Vec_297 Word2Vec_298 Word2Vec_299 Predicted
15 Country_02 Local_05 Metals 0 3 Male Employee Liquid Metal 4 Thursday ... 0.011545 0.009870 0.001075 0.012843 0.012551 0.001392 -0.009150 0.007856 -0.002244 0
23 Country_02 Local_02 Mining 1 1 Male Contractor (Remote) Others 15 Monday ... 0.014734 0.011911 0.001155 0.015744 0.015248 0.001218 -0.012202 0.010318 -0.002047 1
29 Country_02 Local_07 Mining 1 2 Male Employee Others 16 Tuesday ... 0.012957 0.010844 0.001492 0.013304 0.014297 0.000404 -0.010245 0.009415 -0.000563 1
30 Country_01 Local_03 Mining 0 1 Male Employee Others 17 Wednesday ... 0.010988 0.009054 0.000557 0.012007 0.011500 0.001170 -0.008776 0.007065 -0.001432 0
32 Country_01 Local_01 Mining 2 3 Male Contractor Others 21 Sunday ... 0.013835 0.011202 0.000884 0.013832 0.014508 0.000181 -0.011331 0.008218 -0.001973 2

5 rows × 315 columns

Removing Accident Level from the Merged Dataset, since already we have Predicted Accident level

In [ ]:
# Columns to drop
columns_to_drop = ['Day', 'Accident Level', 'Description']

# Drop columns from each DataFrame
Glove_df_Model2 = Glove_df_Model2.drop(columns_to_drop, axis=1)
TFIDF_df_Model2 = TFIDF_df_Model2.drop(columns_to_drop, axis=1)
Word2Vec_df_Model2 = Word2Vec_df_Model2.drop(columns_to_drop, axis=1)
In [ ]:
Glove_df_Model2.head()
Out[ ]:
Country City Industry Sector Potential Accident Level Gender Employee type Critical Risk Weekday WeekofYear Weekend ... GloVe_291 GloVe_292 GloVe_293 GloVe_294 GloVe_295 GloVe_296 GloVe_297 GloVe_298 GloVe_299 Predicted
15 Country_02 Local_05 Metals 3 Male Employee Liquid Metal Thursday 5 0 ... -0.024693 -0.047650 -0.015708 0.067314 0.167760 -0.035326 -0.099047 -0.047969 -0.027353 0
23 Country_02 Local_02 Mining 1 Male Contractor (Remote) Others Monday 7 0 ... -0.081535 0.111993 -0.104607 0.018857 0.360266 -0.132124 -0.324510 0.047853 0.064667 1
29 Country_02 Local_07 Mining 2 Male Employee Others Tuesday 7 0 ... 0.002789 0.018379 -0.021721 -0.018772 0.089487 -0.133801 -0.083973 -0.334744 0.253727 1
30 Country_01 Local_03 Mining 1 Male Employee Others Wednesday 7 0 ... 0.091715 -0.004494 0.073564 0.102722 0.159337 0.028924 -0.168550 0.099812 0.025263 0
32 Country_01 Local_01 Mining 3 Male Contractor Others Sunday 7 1 ... -0.120020 0.015738 -0.067694 0.145352 0.122889 -0.015679 -0.186358 0.062675 -0.020945 2

5 rows × 312 columns

In [ ]:
# Calculate target variable distribution for each DataFrame
glove_target_dist = Glove_df_Model2['Potential Accident Level'].value_counts(normalize=False)
tfidf_target_dist = TFIDF_df_Model2['Potential Accident Level'].value_counts(normalize=False)
word2vec_target_dist = Word2Vec_df_Model2['Potential Accident Level'].value_counts(normalize=False)

# Create a DataFrame to display the distributions
target_distribution_df_Model2 = pd.DataFrame({
    'Glove': glove_target_dist,
    'TF-IDF': tfidf_target_dist,
    'Word2Vec': word2vec_target_dist
})

# Print the DataFrame
target_distribution_df_Model2
Out[ ]:
Glove TF-IDF Word2Vec
Potential Accident Level
3 30 30 30
1 19 19 19
2 15 15 15
4 10 10 10
0 7 7 7
In [ ]:
# Balance 'Potential Accident Level' using SMOTE. for all the 3 dataframes.
# Converting categorical features to numerical using one-hot encoding

import pandas as pd
from imblearn.over_sampling import SMOTE

# Function to balance data and one-hot encode categorical features
def balance_and_encode(df):
  # Separate features and target variable
  X = df.drop('Potential Accident Level', axis=1)
  y = df['Potential Accident Level']

  # One-hot encode categorical features (if any)
  categorical_features = X.select_dtypes(include=['object']).columns
  if categorical_features.any():
    X_encoded = pd.get_dummies(X, columns=categorical_features, dtype=int, drop_first=True)
  else:
    X_encoded = X

  # One-hot encode 'DayOfWeek'
  #X_encoded = pd.get_dummies(X_encoded, columns=['DayOfWeek'], dtype=int, drop_first=True)

  # Apply SMOTE to balance the dataset
  smote = SMOTE(random_state=42)
  X_resampled, y_resampled = smote.fit_resample(X_encoded, y)

  # Combine balanced features and target
  balanced_df_Model2 = pd.concat([X_resampled, y_resampled], axis=1)

  return balanced_df_Model2

# Apply the function to each DataFrame
Glove_df_Bal_Model2 = balance_and_encode(Glove_df_Model2)
TFIDF_df_Bal_Model2 = balance_and_encode(TFIDF_df_Model2)
Word2Vec_df_Bal_Model2 = balance_and_encode(Word2Vec_df_Model2)

# Calculate balanced target variable distribution for each DataFrame
glove_balanced_dist_Model2 = Glove_df_Bal_Model2['Potential Accident Level'].value_counts(normalize=False)
tfidf_balanced_dist_Model2 = TFIDF_df_Bal_Model2['Potential Accident Level'].value_counts(normalize=False)
word2vec_balanced_dist_Model2 = Word2Vec_df_Bal_Model2['Potential Accident Level'].value_counts(normalize=False)

# Create a DataFrame to display the balanced distributions
Balanced_Distribution_df_Model2 = pd.DataFrame({
    'Glove (Balanced)': glove_balanced_dist_Model2,
    'TF-IDF (Balanced)': tfidf_balanced_dist_Model2,
    'Word2Vec (Balanced)': word2vec_balanced_dist_Model2
})

# Print the DataFrame
Balanced_Distribution_df_Model2
Out[ ]:
Glove (Balanced) TF-IDF (Balanced) Word2Vec (Balanced)
Potential Accident Level
3 30 30 30
1 30 30 30
2 30 30 30
4 30 30 30
0 30 30 30
In [ ]:
#Rename the final dataframes as Final_NLP_Glove_df, Final_NLP_TFIDF_df & Final_NLP_Word2Vec

Model2_NLP_Glove_df = Glove_df_Bal_Model2.copy()
Model2_NLP_TFIDF_df = TFIDF_df_Bal_Model2.copy()
Model2_NLP_Word2Vec_df = Word2Vec_df_Bal_Model2.copy()
In [ ]:
 
In [ ]:
# Initialise all the known classifiers and  to run model on the 3 dataframes

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import time

# Initialize classifiers
classifiers = {
    "Logistic Regression": LogisticRegression(),
    "Support Vector Machine": SVC(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Gradient Boosting": GradientBoostingClassifier(),
    "XG Boost": XGBClassifier(),
    "Naive Bayes": GaussianNB(),
    "K-Nearest Neighbors": KNeighborsClassifier()
}

# Function to train and evaluate models
def train_and_evaluate(df):
    X = df.drop('Potential Accident Level', axis=1)
    y = df['Potential Accident Level']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    results = []
    for name, clf in classifiers.items():
        start_time = time.time()
        clf.fit(X_train, y_train)
        training_time = time.time() - start_time

        # Train metrics
        y_train_pred = clf.predict(X_train)
        train_accuracy = accuracy_score(y_train, y_train_pred)
        train_precision = precision_score(y_train, y_train_pred, average='weighted')
        train_recall = recall_score(y_train, y_train_pred, average='weighted')
        train_f1 = f1_score(y_train, y_train_pred, average='weighted')

        start_time = time.time()
        y_test_pred = clf.predict(X_test)
        prediction_time = time.time() - start_time

        # Test metrics
        test_accuracy = accuracy_score(y_test, y_test_pred)
        test_precision = precision_score(y_test, y_test_pred, average='weighted')
        test_recall = recall_score(y_test, y_test_pred, average='weighted')
        test_f1 = f1_score(y_test, y_test_pred, average='weighted')

        results.append([name,
                        train_accuracy, train_precision, train_recall, train_f1,
                        test_accuracy, test_precision, test_recall, test_f1,
                        training_time, prediction_time])

    return results

# Train and evaluate on each DataFrame
glove_results_Model2 = train_and_evaluate(Model2_NLP_Glove_df)
tfidf_results_Model2 = train_and_evaluate(Model2_NLP_TFIDF_df)
word2vec_results_Model2 = train_and_evaluate(Model2_NLP_Word2Vec_df)

# Create DataFrames for results
columns = ['Classifier',
           'Train Accuracy', 'Train Precision', 'Train Recall', 'Train F1-score',
           'Test Accuracy', 'Test Precision', 'Test Recall', 'Test F1-score',
           'Training Time', 'Prediction Time']

glove_df_Model2 = pd.DataFrame(glove_results_Model2, columns=columns)
tfidf_df_Model2 = pd.DataFrame(tfidf_results_Model2, columns=columns)
word2vec_df_Model2 = pd.DataFrame(word2vec_results_Model2, columns=columns)
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
In [ ]:
print("Classification matrix for Glove_Model2")
glove_df_Model2
Classification matrix for Glove_Model2
Out[ ]:
Classifier Train Accuracy Train Precision Train Recall Train F1-score Test Accuracy Test Precision Test Recall Test F1-score Training Time Prediction Time
0 Logistic Regression 0.983333 0.983631 0.983333 0.983312 0.800000 0.860000 0.800000 0.808057 0.055197 0.003088
1 Support Vector Machine 0.458333 0.371633 0.458333 0.391387 0.366667 0.398291 0.366667 0.352941 0.007070 0.004404
2 Decision Tree 1.000000 1.000000 1.000000 1.000000 0.633333 0.667513 0.633333 0.645425 0.022944 0.003422
3 Random Forest 1.000000 1.000000 1.000000 1.000000 0.733333 0.871667 0.733333 0.739009 0.200929 0.005950
4 Gradient Boosting 1.000000 1.000000 1.000000 1.000000 0.800000 0.865079 0.800000 0.808718 5.848943 0.004936
5 XG Boost 1.000000 1.000000 1.000000 1.000000 0.766667 0.817778 0.766667 0.772051 0.690330 0.063910
6 Naive Bayes 0.891667 0.921791 0.891667 0.892587 0.633333 0.730556 0.633333 0.651717 0.003937 0.002742
7 K-Nearest Neighbors 0.733333 0.736167 0.733333 0.714334 0.666667 0.642646 0.666667 0.638497 0.002758 0.004863
In [ ]:
print("Classification matrix for TFIDF_Model2")
tfidf_df_Model2
Classification matrix for TFIDF_Model2
Out[ ]:
Classifier Train Accuracy Train Precision Train Recall Train F1-score Test Accuracy Test Precision Test Recall Test F1-score Training Time Prediction Time
0 Logistic Regression 0.983333 0.983631 0.983333 0.983312 0.733333 0.720000 0.733333 0.714286 0.922195 0.022626
1 Support Vector Machine 0.433333 0.365247 0.433333 0.372773 0.333333 0.388889 0.333333 0.338235 0.042314 0.018541
2 Decision Tree 1.000000 1.000000 1.000000 1.000000 0.600000 0.727937 0.600000 0.628230 0.021047 0.019979
3 Random Forest 1.000000 1.000000 1.000000 1.000000 0.733333 0.891905 0.733333 0.745413 0.262207 0.025491
4 Gradient Boosting 1.000000 1.000000 1.000000 1.000000 0.700000 0.797778 0.700000 0.719841 3.073522 0.022744
5 XG Boost 1.000000 1.000000 1.000000 1.000000 0.766667 0.860606 0.766667 0.754932 2.039161 0.393569
6 Naive Bayes 1.000000 1.000000 1.000000 1.000000 0.600000 0.868095 0.600000 0.640878 0.014904 0.012821
7 K-Nearest Neighbors 0.666667 0.692660 0.666667 0.647794 0.666667 0.614444 0.666667 0.598631 0.011832 0.014342
In [ ]:
print("Classification matrix for Word2Vec_Model2")
word2vec_df_Model2
Classification matrix for Word2Vec_Model2
Out[ ]:
Classifier Train Accuracy Train Precision Train Recall Train F1-score Test Accuracy Test Precision Test Recall Test F1-score Training Time Prediction Time
0 Logistic Regression 0.850000 0.852920 0.850000 0.850747 0.633333 0.658519 0.633333 0.638796 0.043646 0.002485
1 Support Vector Machine 0.433333 0.352968 0.433333 0.370639 0.300000 0.359259 0.300000 0.306863 0.005997 0.003526
2 Decision Tree 1.000000 1.000000 1.000000 1.000000 0.633333 0.658333 0.633333 0.616190 0.019104 0.004034
3 Random Forest 1.000000 1.000000 1.000000 1.000000 0.600000 0.617778 0.600000 0.598519 0.213861 0.007555
4 Gradient Boosting 1.000000 1.000000 1.000000 1.000000 0.600000 0.638571 0.600000 0.572991 5.660460 0.003985
5 XG Boost 1.000000 1.000000 1.000000 1.000000 0.700000 0.702143 0.700000 0.693386 0.816368 0.064725
6 Naive Bayes 0.625000 0.622754 0.625000 0.615680 0.400000 0.475000 0.400000 0.426152 0.003691 0.002718
7 K-Nearest Neighbors 0.683333 0.696668 0.683333 0.667066 0.566667 0.545926 0.566667 0.526602 0.002833 0.004601
In [ ]:
# Plotting the classification report for all the ML classifers with training and prediction time comparisions for Model2.

import time
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
import matplotlib.pyplot as plt
import seaborn as sns

# Function to plot classification report and training/prediction times
def plot_results(df, title):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

    # Classification report heatmap
    report_data = df[['Classifier', 'Train Precision', 'Train Recall', 'Train F1-score',
                       'Test Precision', 'Test Recall', 'Test F1-score']].set_index('Classifier')
    sns.heatmap(report_data, annot=True, cmap='Oranges', fmt='.2f', ax=ax1)
    ax1.set_title(f'Classifier Performance - {title}')

    # Training and prediction time comparison
    df.plot(x='Classifier', y=['Training Time', 'Prediction Time'], kind='bar', ax=ax2, cmap='Set3')
    ax2.set_title(f'Training and Prediction Time Model2 - {title}')
    ax2.set_ylabel('Time (seconds)')
    plt.tight_layout()
    plt.show()

# Plot results for each DataFrame
plot_results(glove_df_Model2, 'Glove Embeddings')
plot_results(tfidf_df_Model2, 'TF-IDF Embeddings')
plot_results(word2vec_df_Model2, 'Word2Vec Embeddings')
In [ ]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

def plot_train_test_confusion_matrices(df, df_name):
    X = df.drop('Potential Accident Level', axis=1)
    y = df['Potential Accident Level']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    fig, axes = plt.subplots(8, 2, figsize=(20, 40))
    fig.suptitle(f'Train and Test Confusion Matrices for {df_name}', fontsize=15, y=0.98)

    for i, (name, clf) in enumerate(classifiers.items()):
        clf.fit(X_train, y_train)

        # Train confusion matrix
        y_train_pred = clf.predict(X_train)
        cm_train = confusion_matrix(y_train, y_train_pred)
        disp_train = ConfusionMatrixDisplay(confusion_matrix=cm_train, display_labels=clf.classes_)
        disp_train.plot(ax=axes[i, 0], cmap='Oranges')
        axes[i, 0].set_title(f'{name} (Train)', fontsize=12)

        # Test confusion matrix
        y_test_pred = clf.predict(X_test)
        cm_test = confusion_matrix(y_test, y_test_pred)
        disp_test = ConfusionMatrixDisplay(confusion_matrix=cm_test, display_labels=clf.classes_)
        disp_test.plot(ax=axes[i, 1], cmap='Oranges')
        axes[i, 1].set_title(f'{name} (Test)', fontsize=12)

    plt.tight_layout(rect=[0, 0, 1, 0.96])
    plt.show()
In [ ]:
plot_train_test_confusion_matrices(Model2_NLP_Glove_df, 'Glove Embeddings-Model2')
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
In [ ]:
plot_train_test_confusion_matrices(Model2_NLP_TFIDF_df, 'TFIDF-Model2')
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
In [ ]:
plot_train_test_confusion_matrices(Model2_NLP_Word2Vec_df, 'Word2Vec-Model2')
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

Milestone 2:¶

Input:¶

Preprocessed output from Milestone-1¶

Based on the findings from Milestone 1, it is recommended to use the Glove embeddings dataset for further processing and model development using deep learning techniques.

Design, train and test Neural networks classifiers¶


In [ ]:
# Installing the required modules
!pip install numpy
!pip install --upgrade tensorflow
!pip install --upgrade keras
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (1.26.4)
Requirement already satisfied: tensorflow in /usr/local/lib/python3.10/dist-packages (2.17.1)
Collecting tensorflow
  Downloading tensorflow-2.18.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)
Requirement already satisfied: absl-py>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (1.4.0)
Requirement already satisfied: astunparse>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (1.6.3)
Requirement already satisfied: flatbuffers>=24.3.25 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (24.3.25)
Requirement already satisfied: gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (0.6.0)
Requirement already satisfied: google-pasta>=0.1.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (0.2.0)
Requirement already satisfied: libclang>=13.0.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (18.1.1)
Requirement already satisfied: opt-einsum>=2.3.2 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (3.4.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from tensorflow) (24.2)
Requirement already satisfied: protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<6.0.0dev,>=3.20.3 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (4.25.5)
Requirement already satisfied: requests<3,>=2.21.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (2.32.3)
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from tensorflow) (75.1.0)
Requirement already satisfied: six>=1.12.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (1.16.0)
Requirement already satisfied: termcolor>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (2.5.0)
Requirement already satisfied: typing-extensions>=3.6.6 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (4.12.2)
Requirement already satisfied: wrapt>=1.11.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (1.16.0)
Requirement already satisfied: grpcio<2.0,>=1.24.3 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (1.68.0)
Collecting tensorboard<2.19,>=2.18 (from tensorflow)
  Downloading tensorboard-2.18.0-py3-none-any.whl.metadata (1.6 kB)
Requirement already satisfied: keras>=3.5.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (3.5.0)
Requirement already satisfied: numpy<2.1.0,>=1.26.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (1.26.4)
Requirement already satisfied: h5py>=3.11.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (3.12.1)
Requirement already satisfied: ml-dtypes<0.5.0,>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (0.4.1)
Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (0.37.1)
Requirement already satisfied: wheel<1.0,>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from astunparse>=1.6.0->tensorflow) (0.45.0)
Requirement already satisfied: rich in /usr/local/lib/python3.10/dist-packages (from keras>=3.5.0->tensorflow) (13.9.4)
Requirement already satisfied: namex in /usr/local/lib/python3.10/dist-packages (from keras>=3.5.0->tensorflow) (0.0.8)
Requirement already satisfied: optree in /usr/local/lib/python3.10/dist-packages (from keras>=3.5.0->tensorflow) (0.13.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorflow) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorflow) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorflow) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorflow) (2024.8.30)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.19,>=2.18->tensorflow) (3.7)
Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.19,>=2.18->tensorflow) (0.7.2)
Requirement already satisfied: werkzeug>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.19,>=2.18->tensorflow) (3.1.3)
Requirement already satisfied: MarkupSafe>=2.1.1 in /usr/local/lib/python3.10/dist-packages (from werkzeug>=1.0.1->tensorboard<2.19,>=2.18->tensorflow) (3.0.2)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from rich->keras>=3.5.0->tensorflow) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich->keras>=3.5.0->tensorflow) (2.18.0)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py>=2.2.0->rich->keras>=3.5.0->tensorflow) (0.1.2)
Downloading tensorflow-2.18.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (615.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 615.3/615.3 MB 1.2 MB/s eta 0:00:00
Downloading tensorboard-2.18.0-py3-none-any.whl (5.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.5/5.5 MB 95.0 MB/s eta 0:00:00
Installing collected packages: tensorboard, tensorflow
  Attempting uninstall: tensorboard
    Found existing installation: tensorboard 2.17.1
    Uninstalling tensorboard-2.17.1:
      Successfully uninstalled tensorboard-2.17.1
  Attempting uninstall: tensorflow
    Found existing installation: tensorflow 2.17.1
    Uninstalling tensorflow-2.17.1:
      Successfully uninstalled tensorflow-2.17.1
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tf-keras 2.17.0 requires tensorflow<2.18,>=2.17, but you have tensorflow 2.18.0 which is incompatible.
Successfully installed tensorboard-2.18.0 tensorflow-2.18.0
Requirement already satisfied: keras in /usr/local/lib/python3.10/dist-packages (3.5.0)
Collecting keras
  Downloading keras-3.7.0-py3-none-any.whl.metadata (5.8 kB)
Requirement already satisfied: absl-py in /usr/local/lib/python3.10/dist-packages (from keras) (1.4.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from keras) (1.26.4)
Requirement already satisfied: rich in /usr/local/lib/python3.10/dist-packages (from keras) (13.9.4)
Requirement already satisfied: namex in /usr/local/lib/python3.10/dist-packages (from keras) (0.0.8)
Requirement already satisfied: h5py in /usr/local/lib/python3.10/dist-packages (from keras) (3.12.1)
Requirement already satisfied: optree in /usr/local/lib/python3.10/dist-packages (from keras) (0.13.1)
Requirement already satisfied: ml-dtypes in /usr/local/lib/python3.10/dist-packages (from keras) (0.4.1)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from keras) (24.2)
Requirement already satisfied: typing-extensions>=4.5.0 in /usr/local/lib/python3.10/dist-packages (from optree->keras) (4.12.2)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from rich->keras) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich->keras) (2.18.0)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py>=2.2.0->rich->keras) (0.1.2)
Downloading keras-3.7.0-py3-none-any.whl (1.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 15.2 MB/s eta 0:00:00
Installing collected packages: keras
  Attempting uninstall: keras
    Found existing installation: keras 3.5.0
    Uninstalling keras-3.5.0:
      Successfully uninstalled keras-3.5.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tf-keras 2.17.0 requires tensorflow<2.18,>=2.17, but you have tensorflow 2.18.0 which is incompatible.
Successfully installed keras-3.7.0
In [ ]:
import tensorflow as tf
import keras
import numpy as np
import pandas as pd

print("TensorFlow version:", tf.__version__)
print("Keras version:", keras.__version__)
print("NumPy version:", np.__version__)
TensorFlow version: 2.17.1
Keras version: 3.5.0
NumPy version: 1.26.4
In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [ ]:
file_path = '/content/drive/MyDrive/AIML_Capstone_Project/Final_NLP_Glove_df.csv'
# Read the csv file using pandas
ISH_NLP_Glove_df = pd.read_csv(file_path)

# Display the first few rows of the dataframe
ISH_NLP_Glove_df.head()
Out[ ]:
WeekofYear Weekend GloVe_0 GloVe_1 GloVe_2 GloVe_3 GloVe_4 GloVe_5 GloVe_6 GloVe_7 ... Weekday_Monday Weekday_Saturday Weekday_Sunday Weekday_Thursday Weekday_Tuesday Weekday_Wednesday Season_Spring Season_Summer Season_Winter Accident Level
0 53 0 0.078223 0.040773 -0.041107 -0.293287 -0.148195 -0.085006 0.120392 -0.043692 ... 0 0 0 0 0 0 0 1 0 0
1 53 1 -0.047137 0.109611 -0.049147 -0.199018 0.049427 -0.139335 0.039627 -0.095639 ... 0 1 0 0 0 0 0 1 0 0
2 1 0 -0.057290 0.202640 -0.209550 -0.169683 -0.027187 -0.091942 -0.168629 -0.005628 ... 0 0 0 0 0 1 0 1 0 0
3 1 0 -0.033755 0.019709 -0.029097 -0.216930 -0.088179 -0.137728 -0.017687 0.012178 ... 0 0 0 0 0 0 0 1 0 0
4 1 1 -0.099598 0.082313 -0.132139 -0.090341 -0.122124 -0.055800 0.132037 0.086205 ... 0 0 1 0 0 0 0 1 0 3

5 rows × 362 columns

In [ ]:
# Creating a copy of the dataframe ISH_NLP_Glove_df
ISH_NLP_Glove_df_main = ISH_NLP_Glove_df.copy()

# Display the first few rows of the new dataframe
ISH_NLP_Glove_df_main.head()
Out[ ]:
WeekofYear Weekend GloVe_0 GloVe_1 GloVe_2 GloVe_3 GloVe_4 GloVe_5 GloVe_6 GloVe_7 ... Weekday_Monday Weekday_Saturday Weekday_Sunday Weekday_Thursday Weekday_Tuesday Weekday_Wednesday Season_Spring Season_Summer Season_Winter Accident Level
0 53 0 0.078223 0.040773 -0.041107 -0.293287 -0.148195 -0.085006 0.120392 -0.043692 ... 0 0 0 0 0 0 0 1 0 0
1 53 1 -0.047137 0.109611 -0.049147 -0.199018 0.049427 -0.139335 0.039627 -0.095639 ... 0 1 0 0 0 0 0 1 0 0
2 1 0 -0.057290 0.202640 -0.209550 -0.169683 -0.027187 -0.091942 -0.168629 -0.005628 ... 0 0 0 0 0 1 0 1 0 0
3 1 0 -0.033755 0.019709 -0.029097 -0.216930 -0.088179 -0.137728 -0.017687 0.012178 ... 0 0 0 0 0 0 0 1 0 0
4 1 1 -0.099598 0.082313 -0.132139 -0.090341 -0.122124 -0.055800 0.132037 0.086205 ... 0 0 1 0 0 0 0 1 0 3

5 rows × 362 columns

In [ ]:
# Saving ISH_NLP_Glove_df_main as csv and xlsx

from google.colab import drive
drive.mount('/content/drive')

# Corrected file path
ISH_NLP_Glove_df_main.to_csv('/content/drive/MyDrive/AIML_Capstone_Project/Final_NLP_Glove_df.csv', index=False)
ISH_NLP_Glove_df_main.to_excel('/content/drive/MyDrive/AIML_Capstone_Project/Final_NLP_Glove_df.xlsx', index=False)
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [ ]:
# Display summary statistics
print("\
Summary statistics:")
ISH_NLP_Glove_df_main.describe().T
Summary statistics:
Out[ ]:
count mean std min 25% 50% 75% max
WeekofYear 1545.0 19.589644 13.347339 1.000000 8.000000 17.000000 27.000000 53.000000
Weekend 1545.0 0.136570 0.343504 0.000000 0.000000 0.000000 0.000000 1.000000
GloVe_0 1545.0 -0.031031 0.062240 -0.317722 -0.059310 -0.023450 0.006005 0.186513
GloVe_1 1545.0 0.073986 0.070001 -0.156011 0.032227 0.074974 0.118921 0.322451
GloVe_2 1545.0 -0.074833 0.061172 -0.316431 -0.111710 -0.073061 -0.035238 0.242731
... ... ... ... ... ... ... ... ...
Weekday_Wednesday 1545.0 0.107443 0.309776 0.000000 0.000000 0.000000 0.000000 1.000000
Season_Spring 1545.0 0.113269 0.317023 0.000000 0.000000 0.000000 0.000000 1.000000
Season_Summer 1545.0 0.220065 0.414424 0.000000 0.000000 0.000000 0.000000 1.000000
Season_Winter 1545.0 0.177994 0.382631 0.000000 0.000000 0.000000 0.000000 1.000000
Accident Level 1545.0 2.000000 1.414671 0.000000 1.000000 2.000000 3.000000 4.000000

362 rows × 8 columns

Data Preprocessing:¶

  1. Feature Scaling: Neural networks perform better when input features are on a similar scale. The GloVe features are likely to have different ranges, so scaling them will ensure that all features contribute equally to the model's learning process. Preprocessing step: Apply standardization (mean=0, std=1) or normalization (min-max scaling) to all GloVe features.
  2. Encode the Target Variable: The "Accident Level" column is of type int64, which suggests it's a categorical variable representing different levels of accidents. Preprocessing step: Depending on the number of unique accident levels, applying one-hot encoding or use it as is if it's in a numeric format.
  3. Split the Dataset: To properly evaluate the model's performance, we need separate training and testing sets. Preprocessing step: Splitting the data into training and testing sets (e.g., 80% training, 20% testing).
  4. Dimensionality Reduction (Optional): With 300 GloVe features, there might be some redundancy or noise in the data. Reducing dimensionality could improve model performance and reduce computational costs. Preprocessing step: Applying techniques like Principal Component Analysis (PCA) reduces the number of features while retaining most of the variance in the data.
  5. Data Augmentation (Optional): If the dataset is relatively small (1545 samples might be considered small for some complex neural network architectures), data augmentation techniques could help increase the effective size of the training set. Preprocessing step: Explore domain-specific data augmentation techniques if applicable.

In [ ]:
# Preparing data to be fed into a Neural Network Classifier

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from tensorflow.keras.utils import to_categorical

# Separate features (Glove embeddings) and target variable
X = ISH_NLP_Glove_df_main.drop('Accident Level', axis=1)
y = ISH_NLP_Glove_df_main['Accident Level']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Encode the target variable
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

# Convert target variable to one-hot encoding
y_train_onehot = to_categorical(y_train_encoded)
y_test_onehot = to_categorical(y_test_encoded)
In [ ]:
# Print the shapes of the resulting datasets
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)
print ("\n")
print("Shape of X_train_scaled:", X_train_scaled.shape)
print("Shape of X_test_scaled:", X_test_scaled.shape)
print("Shape of y_train_onehot:", y_train_onehot.shape)
print("Shape of y_test_onehot:", y_test_onehot.shape)
Shape of X_train: (1236, 361)
Shape of X_test: (309, 361)
Shape of y_train: (1236,)
Shape of y_test: (309,)


Shape of X_train_scaled: (1236, 361)
Shape of X_test_scaled: (309, 361)
Shape of y_train_onehot: (1236, 5)
Shape of y_test_onehot: (309, 5)

Base NN Classifier

In [ ]:
# Import necessary libraries for building the neural network
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import SGD, Adam, RMSprop, Adagrad, Adadelta, Adamax, Nadam, AdamW
from tensorflow.keras.utils import to_categorical

# Function to build the model
def build_base_nn_model(input_shape, num_classes, optimizer_name):
    # Define the model architecture
    base_nn_model = Sequential([
        Input(shape=(input_shape,)),
        Dense(128, activation='relu'),
        Dense(64, activation='relu'),
        Dense(num_classes, activation='softmax')
    ])

    # Optimizers dictionary
    optimizers = {
        'SGD': SGD(),
        'RMSprop': RMSprop(),
        'Adam': Adam(),
        'Nadam': Nadam(),
        'AdamW': AdamW()
    }

    # Validate optimizer name
    if optimizer_name not in optimizers:
        raise ValueError("Optimizer " + optimizer_name + " is not recognized. Please choose from " + str(list(optimizers.keys())))

    # Compile the model
    base_nn_model.compile(optimizer=optimizers[optimizer_name], loss='categorical_crossentropy', metrics=['accuracy'])
    return base_nn_model

# Define number of classes and input shape
num_classes = y_train_onehot.shape[1]
input_shape = X_train_scaled.shape[1]  # GloVe embeddings

# Initialize models with different optimizers
base_nn_models = {}
optimizers = ['SGD', 'RMSprop', 'Adam', 'Nadam', 'AdamW']
for opt in optimizers:
    base_nn_models[opt] = build_base_nn_model(input_shape, num_classes, optimizer_name=opt)

print("Base NN Models initialized with different optimizers.")
Base NN Models initialized with different optimizers.
In [ ]:
# Print model summaries for all optimizers
for opt, base_nn_model in base_nn_models.items():
    print(f"Model with {opt} optimizer:")
    base_nn_model.summary()
Model with SGD optimizer:
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ dense (Dense)                        │ (None, 128)                 │          46,336 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_1 (Dense)                      │ (None, 64)                  │           8,256 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_2 (Dense)                      │ (None, 5)                   │             325 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 54,917 (214.52 KB)
 Trainable params: 54,917 (214.52 KB)
 Non-trainable params: 0 (0.00 B)
Model with RMSprop optimizer:
Model: "sequential_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ dense_3 (Dense)                      │ (None, 128)                 │          46,336 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_4 (Dense)                      │ (None, 64)                  │           8,256 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_5 (Dense)                      │ (None, 5)                   │             325 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 54,917 (214.52 KB)
 Trainable params: 54,917 (214.52 KB)
 Non-trainable params: 0 (0.00 B)
Model with Adam optimizer:
Model: "sequential_2"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ dense_6 (Dense)                      │ (None, 128)                 │          46,336 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_7 (Dense)                      │ (None, 64)                  │           8,256 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_8 (Dense)                      │ (None, 5)                   │             325 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 54,917 (214.52 KB)
 Trainable params: 54,917 (214.52 KB)
 Non-trainable params: 0 (0.00 B)
Model with Nadam optimizer:
Model: "sequential_3"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ dense_9 (Dense)                      │ (None, 128)                 │          46,336 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_10 (Dense)                     │ (None, 64)                  │           8,256 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_11 (Dense)                     │ (None, 5)                   │             325 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 54,917 (214.52 KB)
 Trainable params: 54,917 (214.52 KB)
 Non-trainable params: 0 (0.00 B)
Model with AdamW optimizer:
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ dense_12 (Dense)                     │ (None, 128)                 │          46,336 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_13 (Dense)                     │ (None, 64)                  │           8,256 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_14 (Dense)                     │ (None, 5)                   │             325 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 54,917 (214.52 KB)
 Trainable params: 54,917 (214.52 KB)
 Non-trainable params: 0 (0.00 B)
In [ ]:
# Train and evaluate the models
base_nn_model_history = {}
for opt, base_nn_model in base_nn_models.items():
    print(f"Training model with {opt} optimizer...")
    base_nn_model_history[opt] = base_nn_model.fit(X_train_scaled, y_train_onehot, epochs=50, batch_size=32, validation_split=0.2, verbose=0)
    loss, accuracy = base_nn_model.evaluate(X_test_scaled, y_test_onehot, verbose=0)
    print(f"Test Loss ({opt}): {loss:.4f}")
    print(f"Test Accuracy ({opt}): {accuracy:.4f}")

print("Training and evaluation for Base NN complete.")
Training model with SGD optimizer...
Test Loss (SGD): 0.0907
Test Accuracy (SGD): 0.9644
Training model with RMSprop optimizer...
Test Loss (RMSprop): 0.1713
Test Accuracy (RMSprop): 0.9709
Training model with Adam optimizer...
Test Loss (Adam): 0.1162
Test Accuracy (Adam): 0.9612
Training model with Nadam optimizer...
Test Loss (Nadam): 0.1267
Test Accuracy (Nadam): 0.9612
Training model with AdamW optimizer...
Test Loss (AdamW): 0.1164
Test Accuracy (AdamW): 0.9547
Training and evaluation for Base NN complete.

Train vs Validation plots for Accuracy and Loss for Base NN Classifier

In [ ]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(len(optimizers), 2, figsize=(15, 5 * len(optimizers)))

for i, opt in enumerate(optimizers):
    # Accuracy plot
    axes[i, 0].plot(base_nn_model_history[opt].history['accuracy'], label='Train Accuracy', color='blue')
    axes[i, 0].plot(base_nn_model_history[opt].history['val_accuracy'], label='Validation Accuracy', color='green')
    axes[i, 0].set_title(f'Train vs Validation Accuracy ({opt})')
    axes[i, 0].set_xlabel('Epoch')
    axes[i, 0].set_ylabel('Accuracy')
    axes[i, 0].legend()

    # Loss plot
    axes[i, 1].plot(base_nn_model_history[opt].history['loss'], label='Train Loss', color='red')
    axes[i, 1].plot(base_nn_model_history[opt].history['val_loss'], label='Validation Loss', color='orange')
    axes[i, 1].set_title(f'Train vs Validation Loss ({opt})')
    axes[i, 1].set_xlabel('Epoch')
    axes[i, 1].set_ylabel('Loss')
    axes[i, 1].legend()

plt.tight_layout()
plt.show()

Observations:¶

Accuracy Across Optimizers:

  • The RMSprop optimizer achieved the highest test accuracy at 97.09%.
  • The SGD optimizer was slightly behind, with a test accuracy of 96.44%.
  • The Adam and Nadam optimizers had the same test accuracy of 96.12%.
  • The AdamW optimizer showed the lowest test accuracy among the tested optimizers at 95.47%.

Loss Across Optimizers:

  • The SGD optimizer had the lowest test loss at 0.0907, indicating better generalization performance in terms of minimizing errors.
  • The Adam optimizer had a similar test loss to AdamW, but slightly better accuracy.
  • The RMSprop optimizer had a relatively higher test loss (0.1713) despite achieving the highest accuracy.

Consistency:

  • Optimizers like Adam and Nadam performed similarly in both accuracy and loss, suggesting consistent results across these two variants.

Base NN Training:

  • The results indicate that the network's performance varies depending on the optimizer, highlighting the importance of optimizer selection in model training.

Insights¶

  1. RMSprop's Higher Accuracy Despite Loss: The RMSprop optimizer achieved the highest accuracy but with a higher loss compared to SGD. This may indicate that RMSprop focuses on improving classification accuracy but does not minimize the error as effectively as SGD.

  2. SGD's Generalization Capability: The SGD optimizer showed the lowest test loss, suggesting it may generalize better in this setup, though its accuracy is slightly lower than RMSprop.

  3. Trade-offs in Optimizer Selection: While RMSprop delivered the highest accuracy, its higher loss could be a concern depending on the application's sensitivity to errors. SGD may be a better choice if minimizing test loss is a priority.

  4. Adam and Nadam Optimizer Similarity: The similarity in results between Adam and Nadam indicates that adding the Nesterov momentum to Adam (Nadam) did not provide significant improvement in this case.

  5. AdamW Performance: Despite being a variant of Adam with weight decay for better regularization, AdamW underperformed compared to other optimizers in both accuracy and loss, suggesting it might not be ideal for this model's architecture or data.

Classification Reports for Base NN Classifier.

In [ ]:
from sklearn.metrics import classification_report

# Predict on train and test data for each optimizer
y_pred_train = {}
y_pred_test = {}
for opt, base_nn_model in base_nn_models.items():
    y_pred_train[opt] = np.argmax(base_nn_model.predict(X_train_scaled), axis=1)
    y_pred_test[opt] = np.argmax(base_nn_model.predict(X_test_scaled), axis=1)

# Generate classification reports
for opt in optimizers:
    print(f"\nClassification Report for {opt} optimizer:")
    train_report = classification_report(y_train_encoded, y_pred_train[opt], output_dict=True)
    test_report = classification_report(y_test_encoded, y_pred_test[opt], output_dict=True)

    # Create DataFrames for better visualization
    train_df = pd.DataFrame(train_report).transpose()
    test_df = pd.DataFrame(test_report).transpose()

    # Rename columns
    train_df.columns = ['Train_' + col for col in train_df.columns]
    test_df.columns = ['Test_' + col for col in test_df.columns]

    # Concatenate DataFrames
    combined_df = pd.concat([train_df, test_df], axis=1)

    # Display the combined report
    display(combined_df)
    print("\n" * 3)
39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 9ms/step
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 18ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step

Classification Report for SGD optimizer:
Train_precision Train_recall Train_f1-score Train_support Test_precision Test_recall Test_f1-score Test_support
0 0.992278 0.980916 0.986564 262.000000 0.877551 0.914894 0.895833 47.000000
1 0.987395 0.995763 0.991561 236.000000 0.972222 0.958904 0.965517 73.000000
2 0.996047 0.996047 0.996047 253.000000 0.964286 0.964286 0.964286 56.000000
3 0.987705 0.991770 0.989733 243.000000 0.984615 0.969697 0.977099 66.000000
4 1.000000 1.000000 1.000000 242.000000 1.000000 1.000000 1.000000 67.000000
accuracy 0.992718 0.992718 0.992718 0.992718 0.964401 0.964401 0.964401 0.964401
macro avg 0.992685 0.992899 0.992781 1236.000000 0.959735 0.961556 0.960547 309.000000
weighted avg 0.992730 0.992718 0.992713 1236.000000 0.965054 0.964401 0.964646 309.000000




Classification Report for RMSprop optimizer:
Train_precision Train_recall Train_f1-score Train_support Test_precision Test_recall Test_f1-score Test_support
0 0.984733 0.984733 0.984733 262.000000 0.933333 0.893617 0.913043 47.000000
1 0.987395 0.995763 0.991561 236.000000 0.972973 0.986301 0.979592 73.000000
2 0.996047 0.996047 0.996047 253.000000 0.964912 0.982143 0.973451 56.000000
3 0.995851 0.987654 0.991736 243.000000 0.984615 0.969697 0.977099 66.000000
4 1.000000 1.000000 1.000000 242.000000 0.985294 1.000000 0.992593 67.000000
accuracy 0.992718 0.992718 0.992718 0.992718 0.970874 0.970874 0.970874 0.970874
macro avg 0.992805 0.992839 0.992815 1236.000000 0.968226 0.966352 0.967156 309.000000
weighted avg 0.992732 0.992718 0.992719 1236.000000 0.970641 0.970874 0.970643 309.000000




Classification Report for Adam optimizer:
Train_precision Train_recall Train_f1-score Train_support Test_precision Test_recall Test_f1-score Test_support
0 0.984674 0.980916 0.982792 262.000000 0.875000 0.893617 0.884211 47.000000
1 0.991525 0.991525 0.991525 236.000000 0.972603 0.972603 0.972603 73.000000
2 1.000000 0.996047 0.998020 253.000000 0.964912 0.982143 0.973451 56.000000
3 0.987755 0.995885 0.991803 243.000000 0.968750 0.939394 0.953846 66.000000
4 1.000000 1.000000 1.000000 242.000000 1.000000 1.000000 1.000000 67.000000
accuracy 0.992718 0.992718 0.992718 0.992718 0.961165 0.961165 0.961165 0.961165
macro avg 0.992791 0.992875 0.992828 1236.000000 0.956253 0.957551 0.956822 309.000000
weighted avg 0.992726 0.992718 0.992717 1236.000000 0.961481 0.961165 0.961246 309.000000




Classification Report for Nadam optimizer:
Train_precision Train_recall Train_f1-score Train_support Test_precision Test_recall Test_f1-score Test_support
0 0.980916 0.980916 0.980916 262.0000 0.875000 0.893617 0.884211 47.000000
1 0.991489 0.987288 0.989384 236.0000 0.986111 0.972603 0.979310 73.000000
2 0.996047 0.996047 0.996047 253.0000 0.982143 0.982143 0.982143 56.000000
3 0.991770 0.991770 0.991770 243.0000 0.953846 0.939394 0.946565 66.000000
4 0.995885 1.000000 0.997938 242.0000 0.985294 1.000000 0.992593 67.000000
accuracy 0.991100 0.991100 0.991100 0.9911 0.961165 0.961165 0.961165 0.961165
macro avg 0.991221 0.991204 0.991211 1236.0000 0.956479 0.957551 0.956964 309.000000
weighted avg 0.991097 0.991100 0.991097 1236.0000 0.961423 0.961165 0.961244 309.000000




Classification Report for AdamW optimizer:
Train_precision Train_recall Train_f1-score Train_support Test_precision Test_recall Test_f1-score Test_support
0 0.992278 0.980916 0.986564 262.000000 0.886364 0.829787 0.857143 47.000000
1 0.987448 1.000000 0.993684 236.000000 0.972603 0.972603 0.972603 73.000000
2 0.996047 0.996047 0.996047 253.000000 0.964912 0.982143 0.973451 56.000000
3 0.991770 0.991770 0.991770 243.000000 0.940299 0.954545 0.947368 66.000000
4 1.000000 1.000000 1.000000 242.000000 0.985294 1.000000 0.992593 67.000000
accuracy 0.993528 0.993528 0.993528 0.993528 0.954693 0.954693 0.954693 0.954693
macro avg 0.993509 0.993747 0.993613 1236.000000 0.949894 0.947816 0.948632 309.000000
weighted avg 0.993539 0.993528 0.993519 1236.000000 0.953944 0.954693 0.954139 309.000000



Observations:¶

  1. High Training Metrics Across Optimizers: All optimizers demonstrate high performance on the training data across all classes (0-4), with precision, recall, and F1-scores frequently at or near 100%. This suggests that the model fits the training data extremely well.

  2. Performance on Test Data: The test data metrics are generally lower than the training data, which is expected due to generalization challenges, but still are quite high, showing that the models generalize well though not perfectly.

Specific Insights:¶

  1. SGD:
  • Performance: High training and test performance across all classes with particularly strong results in class 4.

  • Test F1-Scores: These are slightly lower compared to training scores, particularly in class 0 and 1 where there is a noticeable drop. This may indicate some overfitting.

  1. RMSprop:
  • Test Performance: Similar to SGD, with a minor drop in precision and recall for class 1 in the test set, which suggests that this class might be a bit more challenging to generalize by this optimizer.
  1. Adam:
  • Consistency: Shows slightly more consistent F1-scores between training and testing than SGD, suggesting better generalization for certain classes.

  • Test Class 4: Notable for achieving 100% across all metrics, indicating exceptional performance on this class.

  1. Nadam:
  • Balanced Performance: Offers good balance with slightly higher test metrics in some classes compared to Adam, especially noticeable in class 0 for test precision and recall.

  • Slight Overfitting: As with others, there's a gap between train and test scores, albeit small.

  1. AdamW:
  • Overall Test Scores: Among the highest, suggesting that this optimizer may provide the best generalization among those tested.

  • Stability: Shows less variation between training and test metrics, particularly in class 4 where it matches or exceeds other optimizers.

Recommendations:¶

  1. Further Investigation: For classes with a significant drop between training and testing (like class 1), it might be beneficial to look into specific features or additional data that can improve model robustness.

  2. Optimizer Choice: AdamW seems to provide the best generalization based on this data. Consider using it for deployment if consistent performance across multiple classes is critical.

  3. Regularization and Tuning: Implement or increase regularization techniques to mitigate overfitting observed particularly in SGD and RMSprop optimizers. Also, tuning hyperparameters specifically for the underperforming classes could yield better results.

Train and Test Confusion Matrices for Base NN Classifier

In [ ]:
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
# Predict on train and test data for each optimizer
y_pred_train = {}
y_pred_test = {}
for opt, base_nn_model in base_nn_models.items():
    y_pred_train[opt] = np.argmax(base_nn_model.predict(X_train_scaled), axis=1)
    y_pred_test[opt] = np.argmax(base_nn_model.predict(X_test_scaled), axis=1)

# Generate confusion matrices
for opt in optimizers:
    print(f"\nConfusion Matrices for Base NN with {opt} optimizer:")
    cm_train = confusion_matrix(y_train_encoded, y_pred_train[opt])
    cm_test = confusion_matrix(y_test_encoded, y_pred_test[opt])

    fig, axes = plt.subplots(1, 2, figsize=(12, 6))

    # Train Confusion Matrix
    sns.heatmap(cm_train, annot=True, fmt="d", cmap="viridis", square=True, ax=axes[0])
    axes[0].set_title(f"Train Confusion Matrix {opt} optimizer:", fontsize = 10)
    axes[0].set_xlabel("Predicted Labels")
    axes[0].set_ylabel("True Labels")

    # Test Confusion Matrix
    sns.heatmap(cm_test, annot=True, fmt="d", cmap="viridis", square=True, ax=axes[1])
    axes[1].set_title(f"Test Confusion Matrix {opt} optimizer:", fontsize = 10)
    axes[1].set_xlabel("Predicted Labels")
    axes[1].set_ylabel("True Labels")

    # Add space between matrices
    plt.subplots_adjust(wspace=1.5)

    plt.tight_layout()
    plt.show()
    print("\n" * 3)
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step 
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step 

Confusion Matrices for Base NN with SGD optimizer:




Confusion Matrices for Base NN with RMSprop optimizer:




Confusion Matrices for Base NN with Adam optimizer:




Confusion Matrices for Base NN with Nadam optimizer:




Confusion Matrices for Base NN with AdamW optimizer:



Observations:¶

  1. Training Performance:
  • Most optimizers perform well on the training data with high diagonal values, indicating good classification for each label.

  • Minor misclassifications are observed, notably a few instances of class 0 being predicted as class 3 across different optimizers.

  1. Testing Performance:
  • Testing performance slightly decreases, which is typical due to the model facing unseen data.

  • The decrease in performance is not drastic, which indicates good generalization for most optimizers.

  1. Specific Insights:

SGD:

  • Misclassifications are slightly higher in testing, especially between classes 0 and 3.
  • Class 1 (true label) shows strong accuracy with 71 out of 73 correctly predicted on the test set.

RMSprop:

  • Shows increased misclassification between class 0 and other classes in the training set.
  • Test set performance for class 1 is consistent with the training, showing reliable performance.

Adam:

  • Similar patterns in both training and testing sets, with slight misclassifications mainly in classes near to each other (e.g., 1 mispredicted as 0).
  • This optimizer shows relatively balanced performance across all classes.

Nadam:

  • Slightly better at handling class 4 misclassifications compared to others.
  • Noticeably more misclassifications between class 0 and 3 in the test set compared to training.

AdamW:

  • Shows some robustness in dealing with class 3 and 4 but has slight confusion in class 2 predictions.
  • Test results are similar to training, indicating good consistency. # Conclusion:

Specific Class Performance:

  • For Class 1, optimizers like SGD, Adam, and Nadam consistently show high accuracy on the testing dataset.

Misclassification Patterns:

  • Certain patterns like misclassifications between classes 0 and 3 are more prevalent in some optimizers (Nadam, Adam) than others.

Balancing Decision:

  • Consistency between Training and Testing: AdamW shows good consistency between training and testing.
  • Overall Accuracy and Stability: Adam and Nadam also display strong and stable performance, but AdamW seems slightly better in terms of maintaining performance from training to testing.

Hypertuned NN classifer

In [ ]:
from tensorflow.keras.layers import BatchNormalization, Dropout
from tensorflow.keras.regularizers import l2

# Function to build the improved model
def build_ht_nn_model(input_shape, num_classes, optimizer_name):
    # Define the model architecture
    ht_nn_model = Sequential([
        Input(shape=(input_shape,)),
        Dense(256, activation='relu', kernel_regularizer=l2(0.001)),
        BatchNormalization(),
        Dropout(0.2),
        Dense(128, activation='relu', kernel_regularizer=l2(0.001)),
        BatchNormalization(),
        Dropout(0.2),
        Dense(64, activation='relu', kernel_regularizer=l2(0.001)),
        BatchNormalization(),
        Dropout(0.2),
        Dense(32, activation='relu', kernel_regularizer=l2(0.001)),
        BatchNormalization(),
        Dropout(0.2),
        Dense(num_classes, activation='softmax')
    ])

    # Optimizers dictionary
    optimizers = {
        'SGD': SGD(),
        'RMSprop': RMSprop(),
        'Adam': Adam(),
        'Nadam': Nadam(),
        'AdamW': AdamW()
    }

    # Validate optimizer name
    if optimizer_name not in optimizers:
        raise ValueError("Optimizer " + optimizer_name + " is not recognized. Please choose from " + str(list(optimizers.keys())))

    # Compile the model
    ht_nn_model.compile(optimizer=optimizers[optimizer_name], loss='categorical_crossentropy', metrics=['accuracy'])
    return ht_nn_model

# Define number of classes and input shape
num_classes = y_train_onehot.shape[1]
input_shape = X_train_scaled.shape[1]  # GloVe embeddings

# Initialize improved models with different optimizers
ht_nn_models = {}
optimizers = ['SGD', 'RMSprop', 'Adam', 'Nadam', 'AdamW']
for opt in optimizers:
    ht_nn_models[opt] = build_ht_nn_model(input_shape, num_classes, optimizer_name=opt)

print("Hypertuned NN models initialized with different optimizers.")
Hypertuned NN models initialized with different optimizers.
In [ ]:
# Print model summaries for all optimizers
for opt, ht_nn_model in ht_nn_models.items():
    print(f"Hypertuned NN Model with {opt} optimizer:")
    ht_nn_model.summary()
Hypertuned NN Model with SGD optimizer:
Model: "sequential_5"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ dense_15 (Dense)                     │ (None, 256)                 │          92,672 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ batch_normalization                  │ (None, 256)                 │           1,024 │
│ (BatchNormalization)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout (Dropout)                    │ (None, 256)                 │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_16 (Dense)                     │ (None, 128)                 │          32,896 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ batch_normalization_1                │ (None, 128)                 │             512 │
│ (BatchNormalization)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_1 (Dropout)                  │ (None, 128)                 │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_17 (Dense)                     │ (None, 64)                  │           8,256 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ batch_normalization_2                │ (None, 64)                  │             256 │
│ (BatchNormalization)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_2 (Dropout)                  │ (None, 64)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_18 (Dense)                     │ (None, 32)                  │           2,080 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ batch_normalization_3                │ (None, 32)                  │             128 │
│ (BatchNormalization)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_3 (Dropout)                  │ (None, 32)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_19 (Dense)                     │ (None, 5)                   │             165 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 137,989 (539.02 KB)
 Trainable params: 137,029 (535.27 KB)
 Non-trainable params: 960 (3.75 KB)
Hypertuned NN Model with RMSprop optimizer:
Model: "sequential_6"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ dense_20 (Dense)                     │ (None, 256)                 │          92,672 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ batch_normalization_4                │ (None, 256)                 │           1,024 │
│ (BatchNormalization)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_4 (Dropout)                  │ (None, 256)                 │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_21 (Dense)                     │ (None, 128)                 │          32,896 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ batch_normalization_5                │ (None, 128)                 │             512 │
│ (BatchNormalization)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_5 (Dropout)                  │ (None, 128)                 │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_22 (Dense)                     │ (None, 64)                  │           8,256 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ batch_normalization_6                │ (None, 64)                  │             256 │
│ (BatchNormalization)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_6 (Dropout)                  │ (None, 64)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_23 (Dense)                     │ (None, 32)                  │           2,080 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ batch_normalization_7                │ (None, 32)                  │             128 │
│ (BatchNormalization)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_7 (Dropout)                  │ (None, 32)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_24 (Dense)                     │ (None, 5)                   │             165 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 137,989 (539.02 KB)
 Trainable params: 137,029 (535.27 KB)
 Non-trainable params: 960 (3.75 KB)
Hypertuned NN Model with Adam optimizer:
Model: "sequential_7"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ dense_25 (Dense)                     │ (None, 256)                 │          92,672 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ batch_normalization_8                │ (None, 256)                 │           1,024 │
│ (BatchNormalization)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_8 (Dropout)                  │ (None, 256)                 │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_26 (Dense)                     │ (None, 128)                 │          32,896 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ batch_normalization_9                │ (None, 128)                 │             512 │
│ (BatchNormalization)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_9 (Dropout)                  │ (None, 128)                 │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_27 (Dense)                     │ (None, 64)                  │           8,256 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ batch_normalization_10               │ (None, 64)                  │             256 │
│ (BatchNormalization)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_10 (Dropout)                 │ (None, 64)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_28 (Dense)                     │ (None, 32)                  │           2,080 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ batch_normalization_11               │ (None, 32)                  │             128 │
│ (BatchNormalization)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_11 (Dropout)                 │ (None, 32)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_29 (Dense)                     │ (None, 5)                   │             165 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 137,989 (539.02 KB)
 Trainable params: 137,029 (535.27 KB)
 Non-trainable params: 960 (3.75 KB)
Hypertuned NN Model with Nadam optimizer:
Model: "sequential_8"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ dense_30 (Dense)                     │ (None, 256)                 │          92,672 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ batch_normalization_12               │ (None, 256)                 │           1,024 │
│ (BatchNormalization)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_12 (Dropout)                 │ (None, 256)                 │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_31 (Dense)                     │ (None, 128)                 │          32,896 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ batch_normalization_13               │ (None, 128)                 │             512 │
│ (BatchNormalization)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_13 (Dropout)                 │ (None, 128)                 │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_32 (Dense)                     │ (None, 64)                  │           8,256 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ batch_normalization_14               │ (None, 64)                  │             256 │
│ (BatchNormalization)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_14 (Dropout)                 │ (None, 64)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_33 (Dense)                     │ (None, 32)                  │           2,080 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ batch_normalization_15               │ (None, 32)                  │             128 │
│ (BatchNormalization)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_15 (Dropout)                 │ (None, 32)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_34 (Dense)                     │ (None, 5)                   │             165 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 137,989 (539.02 KB)
 Trainable params: 137,029 (535.27 KB)
 Non-trainable params: 960 (3.75 KB)
Hypertuned NN Model with AdamW optimizer:
Model: "sequential_9"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ dense_35 (Dense)                     │ (None, 256)                 │          92,672 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ batch_normalization_16               │ (None, 256)                 │           1,024 │
│ (BatchNormalization)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_16 (Dropout)                 │ (None, 256)                 │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_36 (Dense)                     │ (None, 128)                 │          32,896 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ batch_normalization_17               │ (None, 128)                 │             512 │
│ (BatchNormalization)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_17 (Dropout)                 │ (None, 128)                 │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_37 (Dense)                     │ (None, 64)                  │           8,256 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ batch_normalization_18               │ (None, 64)                  │             256 │
│ (BatchNormalization)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_18 (Dropout)                 │ (None, 64)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_38 (Dense)                     │ (None, 32)                  │           2,080 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ batch_normalization_19               │ (None, 32)                  │             128 │
│ (BatchNormalization)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_19 (Dropout)                 │ (None, 32)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_39 (Dense)                     │ (None, 5)                   │             165 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 137,989 (539.02 KB)
 Trainable params: 137,029 (535.27 KB)
 Non-trainable params: 960 (3.75 KB)
In [ ]:
from keras.callbacks import EarlyStopping
from sklearn.model_selection import KFold

# Early stopping callback
early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

# K-Fold Cross Validation
k = 5  # Number of folds
kf = KFold(n_splits=k, shuffle=True, random_state=42)

# Train and evaluate the improved models with cross-validation
ht_nn_model_history = {}
for model_key, ht_nn_model in ht_nn_models.items():
    print(f"Training Hypertuned NN Classifier model with {model_key}...")
    fold_no = 1
    fold_histories = []
    for train_index, val_index in kf.split(X_train_scaled):
        X_train_fold, X_val_fold = X_train_scaled[train_index], X_train_scaled[val_index]
        y_train_fold, y_val_fold = y_train_onehot[train_index], y_train_onehot[val_index]
        print(f"Training fold {fold_no}...")
        history = ht_nn_model.fit(X_train_fold, y_train_fold, epochs=100, batch_size=32, validation_data=(X_val_fold, y_val_fold), callbacks=[early_stopping], verbose=0)
        fold_histories.append(history)
        print(f"Fold {fold_no} training complete.")
        fold_no += 1

    # Store the history for each model
    ht_nn_model_history[model_key] = fold_histories

    # Evaluate the model on the test set
    loss, accuracy = ht_nn_model.evaluate(X_test_scaled, y_test_onehot, verbose=0)
    print(f"Test Loss (Hypertuned - {model_key}): {loss:.4f}")
    print(f"Test Accuracy (Hypertuned - {model_key}): {accuracy:.4f}")
  # Print early stopping metrics and epoch
    for i, fold_history in enumerate(fold_histories):
        best_val_loss = min(fold_history.history['val_loss'])
        best_val_acc = max(fold_history.history['val_accuracy'])
        early_stopping_epoch = fold_history.epoch[-1]  # Last epoch before early stopping
        print(f"Fold {i+1}: Best Validation Loss: {best_val_loss:.4f}, Best Validation Accuracy: {best_val_acc:.4f}, Early Stopping Epoch: {early_stopping_epoch}")

print("Training and evaluation of Hypertuned NN Classifier model with cross-validation complete.")
Training Hypertuned NN Classifier model with SGD...
Training fold 1...
Fold 1 training complete.
Training fold 2...
Fold 2 training complete.
Training fold 3...
Fold 3 training complete.
Training fold 4...
Fold 4 training complete.
Training fold 5...
Fold 5 training complete.
Test Loss (Hypertuned - SGD): 0.5150
Test Accuracy (Hypertuned - SGD): 0.9644
Fold 1: Best Validation Loss: 0.6325, Best Validation Accuracy: 0.9839, Early Stopping Epoch: 69
Fold 2: Best Validation Loss: 0.5566, Best Validation Accuracy: 1.0000, Early Stopping Epoch: 24
Fold 3: Best Validation Loss: 0.4858, Best Validation Accuracy: 1.0000, Early Stopping Epoch: 99
Fold 4: Best Validation Loss: 0.4623, Best Validation Accuracy: 1.0000, Early Stopping Epoch: 47
Fold 5: Best Validation Loss: 0.4198, Best Validation Accuracy: 0.9960, Early Stopping Epoch: 99
Training Hypertuned NN Classifier model with RMSprop...
Training fold 1...
Fold 1 training complete.
Training fold 2...
Fold 2 training complete.
Training fold 3...
Fold 3 training complete.
Training fold 4...
Fold 4 training complete.
Training fold 5...
Fold 5 training complete.
Test Loss (Hypertuned - RMSprop): 0.2001
Test Accuracy (Hypertuned - RMSprop): 0.9773
Fold 1: Best Validation Loss: 0.2305, Best Validation Accuracy: 0.9758, Early Stopping Epoch: 64
Fold 2: Best Validation Loss: 0.1597, Best Validation Accuracy: 0.9960, Early Stopping Epoch: 25
Fold 3: Best Validation Loss: 0.1063, Best Validation Accuracy: 1.0000, Early Stopping Epoch: 28
Fold 4: Best Validation Loss: 0.1032, Best Validation Accuracy: 1.0000, Early Stopping Epoch: 10
Fold 5: Best Validation Loss: 0.1061, Best Validation Accuracy: 0.9960, Early Stopping Epoch: 20
Training Hypertuned NN Classifier model with Adam...
Training fold 1...
Fold 1 training complete.
Training fold 2...
Fold 2 training complete.
Training fold 3...
Fold 3 training complete.
Training fold 4...
Fold 4 training complete.
Training fold 5...
Fold 5 training complete.
Test Loss (Hypertuned - Adam): 0.3666
Test Accuracy (Hypertuned - Adam): 0.9547
Fold 1: Best Validation Loss: 0.5241, Best Validation Accuracy: 0.9677, Early Stopping Epoch: 51
Fold 2: Best Validation Loss: 0.3636, Best Validation Accuracy: 0.9960, Early Stopping Epoch: 26
Fold 3: Best Validation Loss: 0.2123, Best Validation Accuracy: 1.0000, Early Stopping Epoch: 46
Fold 4: Best Validation Loss: 0.1622, Best Validation Accuracy: 1.0000, Early Stopping Epoch: 43
Fold 5: Best Validation Loss: 0.1700, Best Validation Accuracy: 0.9960, Early Stopping Epoch: 13
Training Hypertuned NN Classifier model with Nadam...
Training fold 1...
Fold 1 training complete.
Training fold 2...
Fold 2 training complete.
Training fold 3...
Fold 3 training complete.
Training fold 4...
Fold 4 training complete.
Training fold 5...
Fold 5 training complete.
Test Loss (Hypertuned - Nadam): 0.2501
Test Accuracy (Hypertuned - Nadam): 0.9709
Fold 1: Best Validation Loss: 0.4392, Best Validation Accuracy: 0.9718, Early Stopping Epoch: 64
Fold 2: Best Validation Loss: 0.2839, Best Validation Accuracy: 0.9960, Early Stopping Epoch: 25
Fold 3: Best Validation Loss: 0.2068, Best Validation Accuracy: 1.0000, Early Stopping Epoch: 36
Fold 4: Best Validation Loss: 0.1694, Best Validation Accuracy: 1.0000, Early Stopping Epoch: 32
Fold 5: Best Validation Loss: 0.1700, Best Validation Accuracy: 0.9960, Early Stopping Epoch: 10
Training Hypertuned NN Classifier model with AdamW...
Training fold 1...
Fold 1 training complete.
Training fold 2...
Fold 2 training complete.
Training fold 3...
Fold 3 training complete.
Training fold 4...
Fold 4 training complete.
Training fold 5...
Fold 5 training complete.
Test Loss (Hypertuned - AdamW): 0.2210
Test Accuracy (Hypertuned - AdamW): 0.9773
Fold 1: Best Validation Loss: 0.4668, Best Validation Accuracy: 0.9758, Early Stopping Epoch: 58
Fold 2: Best Validation Loss: 0.3059, Best Validation Accuracy: 0.9960, Early Stopping Epoch: 28
Fold 3: Best Validation Loss: 0.1961, Best Validation Accuracy: 1.0000, Early Stopping Epoch: 40
Fold 4: Best Validation Loss: 0.1715, Best Validation Accuracy: 1.0000, Early Stopping Epoch: 27
Fold 5: Best Validation Loss: 0.1688, Best Validation Accuracy: 0.9960, Early Stopping Epoch: 15
Training and evaluation of Hypertuned NN Classifier model with cross-validation complete.

Displaying Average Train vs Validation accuracy and Average Train vs Validation loss for Hypertuned NN Classifier

In [ ]:
# Create a dictionary to store the results
results = {}

# Loop through each optimizer and its history
for opt, histories in ht_nn_model_history.items():
    train_acc = []
    val_acc = []
    train_loss = []
    val_loss = []

    # Loop through each fold's history
    for history in histories:
        train_acc.extend(history.history['accuracy'])
        val_acc.extend(history.history['val_accuracy'])
        train_loss.extend(history.history['loss'])
        val_loss.extend(history.history['val_loss'])
# Calculate average values
    avg_train_acc = np.mean(train_acc) * 100
    avg_val_acc = np.mean(val_acc) * 100
    avg_train_loss = np.mean(train_loss)
    avg_val_loss = np.mean(val_loss)

    # Store the results in the dictionary
    results[opt] = {
        'Avg_Train_Accuracy': avg_train_acc,
        'Avg_Val_Accuracy': avg_val_acc,
        'Avg_Train_Loss': avg_train_loss,
        'Avg_Val_Loss': avg_val_loss
    }

# Create a pandas DataFrame from the results
df_results = pd.DataFrame.from_dict(results, orient='index')

# Display the DataFrame
display(df_results)
Avg_Train_Accuracy Avg_Val_Accuracy Avg_Train_Loss Avg_Val_Loss
SGD 98.071551 98.767230 0.566529 0.539812
RMSprop 97.970452 97.085000 0.262341 0.304658
Adam 97.983219 97.674442 0.365288 0.378328
Nadam 97.942916 97.165602 0.362498 0.391174
AdamW 97.961864 97.631997 0.364066 0.376799

Train vs Validation plots for Accuracy and Loss for Hypertuned NN Classifier for all optimisers.

In [ ]:
fig, axes = plt.subplots(len(optimizers), 2, figsize=(15, 5 * len(optimizers)))

for i, opt in enumerate(optimizers):
    # Get the history for the first fold (you can average over folds if needed)
    fold_history = ht_nn_model_history[opt][0]

    # Accuracy plot
    axes[i, 0].plot(fold_history.history['accuracy'], label='Train Accuracy', color='blue')
    axes[i, 0].plot(fold_history.history['val_accuracy'], label='Validation Accuracy', color='green')
    axes[i, 0].set_title(f'Train vs Validation Accuracy (Hypertuned - {opt})')
    axes[i, 0].set_xlabel('Epoch')
    axes[i, 0].set_ylabel('Accuracy')
    axes[i, 0].legend()

    # Loss plot
    axes[i, 1].plot(fold_history.history['loss'], label='Train Loss', color='red')
    axes[i, 1].plot(fold_history.history['val_loss'], label='Validation Loss', color='orange')
    axes[i, 1].set_title(f'Train vs Validation Loss (Hypertuned - {opt})')
    axes[i, 1].set_xlabel('Epoch')
    axes[i, 1].set_ylabel('Loss')
    axes[i, 1].legend()

plt.tight_layout()
plt.show()

Observations:¶

  1. SGD (Stochastic Gradient Descent)
  • Accuracy: The training and validation accuracy curves converge closely, indicating good generalization.

  • Loss: Both training and validation loss decrease sharply and stabilize quickly, showing that SGD is effective and efficient in optimizing the loss function.

  1. RMSprop
  • Accuracy: There's a noticeable gap between training and validation accuracy, suggesting some overfitting. However, the validation accuracy does improve steadily, which is a good sign.

  • Loss: The training and validation loss curves decrease quickly. The small gap between them suggests some level of overfitting, though less severe compared to the other Adam-based optimizers.

  1. Adam
  • Accuracy: The accuracy curves for Adam show that while there's an improvement in validation accuracy, the gap between training and validation accuracy is significant, indicating overfitting.

  • Loss: The loss curves converge well initially but start to diverge slightly, which again points to overfitting as training progresses.

  1. Nadam
  • Accuracy: Similar to Adam, Nadam shows a gap between the training and validation accuracy curves, indicative of overfitting.

  • Loss: The loss curves show less divergence compared to Adam, suggesting a slightly better handling of overfitting.

  • AdamW

  • Accuracy: The gap between training and validation accuracy is somewhat large, which suggests overfitting. The improvement in validation accuracy is slower and less stable compared to the other optimizers.

  • Loss: Similar to the accuracy results, the loss curves show a significant gap, indicating that AdamW might not be as effective in this case.

Insights¶

  1. SGD not only achieves the highest accuracies (98.07% training, 98.76% validation) but also maintains the lowest losses (0.566 training, 0.539 validation), confirming its superiority in both learning and generalizing from the training data.

  2. RMSprop, Adam, and Nadam show a clear gap between training and validation metrics, indicating some degree of overfitting, though RMSprop and Nadam manage slightly better generalization than Adam.

  3. AdamW exhibits the largest gap, suggesting significant overfitting and the need for adjustments in model training strategy or hyperparameter settings.

Classification Reports for Hypertuned NN Classifier.

In [ ]:
# Predict on train and test data for each optimizer
y_pred_train = {}
y_pred_test = {}
for opt, ht_nn_model in ht_nn_models.items():
    y_pred_train[opt] = np.argmax(ht_nn_model.predict(X_train_scaled), axis=1)
    y_pred_test[opt] = np.argmax(ht_nn_model.predict(X_test_scaled), axis=1)

# Generate classification reports
for opt in optimizers:
    print(f"\nClassification Report for Hypertuned Model with {opt} optimizer:")
    train_report = classification_report(y_train_encoded, y_pred_train[opt], output_dict=True)
    test_report = classification_report(y_test_encoded, y_pred_test[opt], output_dict=True)

    # Create DataFrames for better visualization
    train_df = pd.DataFrame(train_report).transpose()
    test_df = pd.DataFrame(test_report).transpose()

    # Rename columns
    train_df.columns = ['Train_' + col for col in train_df.columns]
    test_df.columns = ['Test_' + col for col in test_df.columns]

    # Concatenate DataFrames
    combined_df = pd.concat([train_df, test_df], axis=1)

    # Display the combined report
    display(combined_df)
    print("\n" * 3)
39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 21ms/step
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 27ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 9ms/step
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 19ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 9ms/step
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 19ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 9ms/step
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 18ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 8ms/step
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 19ms/step

Classification Report for Hypertuned Model with SGD optimizer:
Train_precision Train_recall Train_f1-score Train_support Test_precision Test_recall Test_f1-score Test_support
0 1.000000 0.996183 0.998088 262.000000 0.862745 0.936170 0.897959 47.000000
1 1.000000 1.000000 1.000000 236.000000 0.972603 0.972603 0.972603 73.000000
2 1.000000 1.000000 1.000000 253.000000 0.982143 0.982143 0.982143 56.000000
3 0.995902 1.000000 0.997947 243.000000 0.983871 0.924242 0.953125 66.000000
4 1.000000 1.000000 1.000000 242.000000 1.000000 1.000000 1.000000 67.000000
accuracy 0.999191 0.999191 0.999191 0.999191 0.964401 0.964401 0.964401 0.964401
macro avg 0.999180 0.999237 0.999207 1236.000000 0.960272 0.963032 0.961166 309.000000
weighted avg 0.999194 0.999191 0.999191 1236.000000 0.965969 0.964401 0.964758 309.000000




Classification Report for Hypertuned Model with RMSprop optimizer:
Train_precision Train_recall Train_f1-score Train_support Test_precision Test_recall Test_f1-score Test_support
0 1.000000 0.992366 0.996169 262.000000 1.000000 0.872340 0.931818 47.000000
1 0.995781 1.000000 0.997886 236.000000 0.973333 1.000000 0.986486 73.000000
2 0.996063 1.000000 0.998028 253.000000 0.949153 1.000000 0.973913 56.000000
3 0.995885 0.995885 0.995885 243.000000 0.970149 0.984848 0.977444 66.000000
4 1.000000 1.000000 1.000000 242.000000 1.000000 1.000000 1.000000 67.000000
accuracy 0.997573 0.997573 0.997573 0.997573 0.977346 0.977346 0.977346 0.977346
macro avg 0.997546 0.997650 0.997593 1236.000000 0.978527 0.971438 0.973932 309.000000
weighted avg 0.997579 0.997573 0.997571 1236.000000 0.978109 0.977346 0.976891 309.000000




Classification Report for Hypertuned Model with Adam optimizer:
Train_precision Train_recall Train_f1-score Train_support Test_precision Test_recall Test_f1-score Test_support
0 1.000000 0.996183 0.998088 262.000000 0.811321 0.914894 0.860000 47.000000
1 0.995781 1.000000 0.997886 236.000000 0.985915 0.958904 0.972222 73.000000
2 1.000000 1.000000 1.000000 253.000000 1.000000 0.946429 0.972477 56.000000
3 0.995885 0.995885 0.995885 243.000000 0.953846 0.939394 0.946565 66.000000
4 1.000000 1.000000 1.000000 242.000000 1.000000 1.000000 1.000000 67.000000
accuracy 0.998382 0.998382 0.998382 0.998382 0.954693 0.954693 0.954693 0.954693
macro avg 0.998333 0.998414 0.998372 1236.000000 0.950216 0.951924 0.950253 309.000000
weighted avg 0.998385 0.998382 0.998382 1236.000000 0.958116 0.954693 0.955742 309.000000




Classification Report for Hypertuned Model with Nadam optimizer:
Train_precision Train_recall Train_f1-score Train_support Test_precision Test_recall Test_f1-score Test_support
0 1.000000 0.988550 0.994242 262.000000 0.953488 0.872340 0.911111 47.000000
1 0.995781 1.000000 0.997886 236.000000 0.972222 0.958904 0.965517 73.000000
2 0.992157 1.000000 0.996063 253.000000 0.949153 1.000000 0.973913 56.000000
3 0.995885 0.995885 0.995885 243.000000 0.970588 1.000000 0.985075 66.000000
4 1.000000 1.000000 1.000000 242.000000 1.000000 1.000000 1.000000 67.000000
accuracy 0.996764 0.996764 0.996764 0.996764 0.970874 0.970874 0.970874 0.970874
macro avg 0.996764 0.996887 0.996815 1236.000000 0.969090 0.966249 0.967123 309.000000
weighted avg 0.996780 0.996764 0.996761 1236.000000 0.970866 0.970874 0.970418 309.000000




Classification Report for Hypertuned Model with AdamW optimizer:
Train_precision Train_recall Train_f1-score Train_support Test_precision Test_recall Test_f1-score Test_support
0 1.000000 0.992366 0.996169 262.000000 0.916667 0.936170 0.926316 47.000000
1 1.000000 1.000000 1.000000 236.000000 1.000000 0.986301 0.993103 73.000000
2 0.996063 1.000000 0.998028 253.000000 0.964912 0.982143 0.973451 56.000000
3 0.995902 1.000000 0.997947 243.000000 0.984615 0.969697 0.977099 66.000000
4 1.000000 1.000000 1.000000 242.000000 1.000000 1.000000 1.000000 67.000000
accuracy 0.998382 0.998382 0.998382 0.998382 0.977346 0.977346 0.977346 0.977346
macro avg 0.998393 0.998473 0.998429 1236.000000 0.973239 0.974862 0.973994 309.000000
weighted avg 0.998388 0.998382 0.998380 1236.000000 0.977680 0.977346 0.977460 309.000000



Observations:¶

  1. High Training Performance: All models exhibit nearly perfect precision, recall, and F1-scores on the training data, indicative of strong fits.

  2. Testing Performance Variation: Testing metrics show considerable variation across optimizers, especially in categories like 0, where precision fluctuates widely.

  3. Consistency in Certain Categories: Categories 1 and 4 consistently achieve near-perfect testing metrics, reflecting well-represented and easily distinguishable features in the dataset.

Optimizer-Specific Observations:¶

  1. SGD Optimizer:
  • Testing Recall: Shows strong recall for most categories (e.g., 97.26% for category 1, 98.21% for category 2).
  • Testing Precision: Lower precision in categories 0 (86.27%) and 3 (90.35%), suggesting potential challenges with generalization.
  1. RMSprop Optimizer:
  • Testing Precision for Category 0: Experiences a significant drop (80.00%), possibly due to overfitting or insufficient feature generalization.
  • Overall Accuracy: Maintains high overall testing accuracy (97.73%).
  1. Adam Optimizer:
  • Balanced Testing Metrics: Delivers balanced performance with high recall (e.g., 93.93% for category 3) and precision (e.g., 100% for category 4).
  • Testing Precision for Category 0: Experiences reduced precision (81.13%), indicating slight overfitting or representation issues.
  1. Nadam Optimizer:
  • Testing Recall and Precision: Achieves strong recall across categories, with a slight dip in precision for category 0 (95.35%) compared to category 4 (100%).
  • Overall Consistency: Marginally better recall than Adam, indicating slight edges in some categories.
  1. AdamW Optimizer:
  • Testing Precision and Recall: Maintains strong testing performance with high recall (98.62%) and precision (97.32%) across most categories. Category 0 Precision: Marginally lower precision (91.67%) compared to Nadam but consistent across other categories.

Recommendations:¶

  1. Model Selection:
  • Adam and Nadam are preferred due to their balanced performance across most categories, particularly for critical classes like 1 and 4.
  • AdamW is a viable option for slightly higher overall performance at the cost of precision in category 0.
  1. Further Tuning:
  • For optimizers like RMSprop and SGD, focus on hyperparameter adjustments or ensemble methods to address specific weak points in categories like 0 and 3.
  • Investigate category 0 representation in the training set for potential under-representation or overlapping features.

Train and Test Confusion Matrices for Hypertuned NN Classifier for all optimisers.

In [ ]:
# Predict on train and test data for each optimizer
y_pred_train = {}
y_pred_test = {}
for opt, ht_nn_model in ht_nn_models.items():
    y_pred_train[opt] = np.argmax(ht_nn_model.predict(X_train_scaled), axis=1)
    y_pred_test[opt] = np.argmax(ht_nn_model.predict(X_test_scaled), axis=1)

# Generate confusion matrices
for opt in optimizers:
    print(f"\nConfusion Matrices for Hypertuned NN with {opt} optimizer:")
    cm_train = confusion_matrix(y_train_encoded, y_pred_train[opt])
    cm_test = confusion_matrix(y_test_encoded, y_pred_test[opt])

    fig, axes = plt.subplots(1, 2, figsize=(12, 6))

    # Train Confusion Matrix
    sns.heatmap(cm_train, annot=True, fmt="d", cmap="Greens", square=True, ax=axes[0])
    axes[0].set_title(f"Train Confusion Matrix (Hypertuned - {opt})", fontsize = 10)
    axes[0].set_xlabel("Predicted Labels")
    axes[0].set_ylabel("True Labels")

    # Test Confusion Matrix
    sns.heatmap(cm_test, annot=True, fmt="d", cmap="Greens", square=True, ax=axes[1])
    axes[1].set_title(f"Test Confusion Matrix (Hypertuned - {opt})", fontsize = 10)
    axes[1].set_xlabel("Predicted Labels")
    axes[1].set_ylabel("True Labels")

    # Add space between matrices
    plt.subplots_adjust(wspace=1.5)

    plt.tight_layout()
    plt.show()
    print("\n" * 3)
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step 
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step 

Confusion Matrices for Hypertuned NN with SGD optimizer:




Confusion Matrices for Hypertuned NN with RMSprop optimizer:




Confusion Matrices for Hypertuned NN with Adam optimizer:




Confusion Matrices for Hypertuned NN with Nadam optimizer:




Confusion Matrices for Hypertuned NN with AdamW optimizer:



Observations:¶

  1. Accuracy Across Optimizers: High Accuracy Levels:
  • All optimizers demonstrate high accuracy on the diagonal (true positives), showing their capability to effectively learn and predict the correct class labels.
  • AdamW and Nadam show high true positive rates across nearly all classes, especially noticeable in hypertuned states.

Common Misclassification Patterns:

  • Misclassifications commonly occur between specific classes (notably classes 0, 2, and 3) across most optimizers, indicating challenges inherent in the data features or similarities between these classes that are not optimizer-specific.
  1. Generalization and Overfitting: Generalization Issues:
  • A noticeable performance drop from training to testing datasets highlights generalization challenges. For instance, Adam often showed better training performance that did not always translate equally to the testing scenarios.

Impact of Hyperparameter Tuning:

  • Hyperparameter tuning tends to improve test set accuracy, as seen with RMSprop and Nadam, suggesting that tuning helps in enhancing model generalization across unseen data.
  1. Consistency and Stability: Training vs. Testing Discrepancy:
  • Most optimizers, including SGD and Adam, exhibit some consistency issues, with models performing exceptionally well in training but less so in testing, indicating potential overfitting. Tuning typically reduces this gap.

Stability Across Classes:

  • While tuning generally enhances stability, it sometimes exacerbates misclassifications in certain classes, such as increased errors for class 1 in the hypertuned AdamW model.
  1. Optimizer-Specific Findings:

SGD:

  • Shows the necessity of careful hyperparameter adjustments to prevent overfitting, as minimal tuning leads to modest improvements.

RMSprop and Nadam:

  • These optimizers benefit significantly from hyperparameter tuning, particularly in adjusting momentum terms that help in navigating the error landscape more effectively.

Adam and AdamW:

Demonstrates flexibility and robust initial performance, with AdamW slightly better at managing long-term stability thanks to effective handling of weight decay.

Recommendations:

  1. Enhanced Parameter Tuning:
  • Explore more adaptive learning rate schedules and advanced regularization techniques to address specific misclassification issues, such as dynamic learning rate adjustments based on validation loss feedback.
  1. Cross-validation Techniques:
  • Employ robust cross-validation to ensure the optimizer's performance is reliable across various subsets of data, enhancing the model’s reliability and predictability.
  1. Detailed Feature Analysis:
  • Analyze features leading to high misclassification rates to refine model inputs, potentially incorporating dimensionality reduction techniques or feature selection algorithms to enhance class separability.
  1. Advanced Optimization Techniques:
  • Experiment with less common optimizers or combinations of multiple optimizers (ensemble methods) to potentially leverage the strengths of various optimization algorithms.

Design, Train and Test RNN or LSTM classifiers

Designing Base RNN Classifier using SimpleRNN

In [ ]:
from tensorflow.keras.layers import LSTM, Dense, Dropout, SimpleRNN
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.optimizers import Adam, SGD, RMSprop, Nadam, AdamW
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Dense
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler


def create_rnn_model(optimizer='adam'):
    rnn_model = Sequential()
    rnn_model.add(SimpleRNN(units=32, input_shape=(X_train_scaled.shape[1], 1)))  # Adjust input_shape as needed
    rnn_model.add(Dense(units=y_train_onehot.shape[1], activation='sigmoid'))
    rnn_model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return rnn_model

optimizers = ['sgd', 'rmsprop', 'adam', 'nadam', 'adamw']
rnn_models = {}
rnn_model_history = {}
In [ ]:
for opt in optimizers:
    rnn_models[opt] = create_rnn_model(optimizer=opt)
    print(f"RNN Model with {opt} optimizer:")
    rnn_models[opt].summary()
    rnn_model_history[opt] = rnn_models[opt].fit(X_train_scaled, y_train_onehot, epochs=10, batch_size=32, validation_split=0.2, verbose=0)
/usr/local/lib/python3.10/dist-packages/keras/src/layers/rnn/rnn.py:204: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
  super().__init__(**kwargs)
RNN Model with sgd optimizer:
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ simple_rnn (SimpleRNN)               │ (None, 32)                  │           1,088 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense (Dense)                        │ (None, 5)                   │             165 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 1,253 (4.89 KB)
 Trainable params: 1,253 (4.89 KB)
 Non-trainable params: 0 (0.00 B)
RNN Model with rmsprop optimizer:
Model: "sequential_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ simple_rnn_1 (SimpleRNN)             │ (None, 32)                  │           1,088 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_1 (Dense)                      │ (None, 5)                   │             165 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 1,253 (4.89 KB)
 Trainable params: 1,253 (4.89 KB)
 Non-trainable params: 0 (0.00 B)
RNN Model with adam optimizer:
Model: "sequential_2"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ simple_rnn_2 (SimpleRNN)             │ (None, 32)                  │           1,088 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_2 (Dense)                      │ (None, 5)                   │             165 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 1,253 (4.89 KB)
 Trainable params: 1,253 (4.89 KB)
 Non-trainable params: 0 (0.00 B)
RNN Model with nadam optimizer:
Model: "sequential_3"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ simple_rnn_3 (SimpleRNN)             │ (None, 32)                  │           1,088 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_3 (Dense)                      │ (None, 5)                   │             165 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 1,253 (4.89 KB)
 Trainable params: 1,253 (4.89 KB)
 Non-trainable params: 0 (0.00 B)
RNN Model with adamw optimizer:
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ simple_rnn_4 (SimpleRNN)             │ (None, 32)                  │           1,088 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_4 (Dense)                      │ (None, 5)                   │             165 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 1,253 (4.89 KB)
 Trainable params: 1,253 (4.89 KB)
 Non-trainable params: 0 (0.00 B)
In [ ]:
for opt, rnn_model1 in rnn_models.items():
    print(f"Training model with {opt} optimizer...")
    rnn_model_history[opt] = rnn_model1.fit(X_train_scaled, y_train_onehot, epochs=50, batch_size=32, validation_split=0.2, verbose=0)
    loss, accuracy = rnn_model1.evaluate(X_test_scaled, y_test_onehot, verbose=0)
    print(f"Test Loss ({opt}): {loss:.4f}")
    print(f"Test Accuracy ({opt}): {accuracy:.4f}")

print("Training and evaluation for RNN complete.")
Training model with sgd optimizer...
Test Loss (sgd): 0.9183
Test Accuracy (sgd): 0.6440
Training model with rmsprop optimizer...
Test Loss (rmsprop): 0.9423
Test Accuracy (rmsprop): 0.6731
Training model with adam optimizer...
Test Loss (adam): 0.7714
Test Accuracy (adam): 0.7411
Training model with nadam optimizer...
Test Loss (nadam): 0.7768
Test Accuracy (nadam): 0.7573
Training model with adamw optimizer...
Test Loss (adamw): 0.8634
Test Accuracy (adamw): 0.7152
Training and evaluation for RNN complete.

Train vs Validation plots for Accuracy and Loss for Base RNN Classifier

In [ ]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(len(optimizers), 2, figsize=(15, 5 * len(optimizers)))

for i, opt in enumerate(optimizers):
    # Accuracy plot
    axes[i, 0].plot(rnn_model_history[opt].history['accuracy'], label='Train Accuracy', color='blue')
    axes[i, 0].plot(rnn_model_history[opt].history['val_accuracy'], label='Validation Accuracy', color='green')
    axes[i, 0].set_title(f'Train vs Validation Accuracy ({opt})')
    axes[i, 0].set_xlabel('Epoch')
    axes[i, 0].set_ylabel('Accuracy')
    axes[i, 0].legend()

    # Loss plot
    axes[i, 1].plot(rnn_model_history[opt].history['loss'], label='Train Loss', color='red')
    axes[i, 1].plot(rnn_model_history[opt].history['val_loss'], label='Validation Loss', color='orange')
    axes[i, 1].set_title(f'Train vs Validation Loss ({opt})')
    axes[i, 1].set_xlabel('Epoch')
    axes[i, 1].set_ylabel('Loss')
    axes[i, 1].legend()

plt.tight_layout()
plt.show()

Overall Summary:¶

  1. Training Observations:
    • All models show near-perfect training metrics, suggesting excellent fitting on the training set. However, this could also hint at overfitting if generalization is inadequate on the test data.
  1. Test Accuracy and Loss Trends:

    • Best Accuracy: The Nadam optimizer achieves the highest test accuracy (75.73%), closely followed by Adam (74.11%). Both optimizers also maintain relatively low test loss values, highlighting their effectiveness in balancing training and generalization.

    • Worst Accuracy: SGD performs the poorest (64.40%), coupled with a relatively high test loss (0.9183), indicating challenges in convergence and generalization.

    • Moderate Performance: RMSprop (67.31%) and AdamW (71.52%) perform moderately well, but AdamW struggles slightly with higher test loss (0.8634).
  2. Classification Report Insights:

    • Precision, Recall, and F1-Scores:
    • Adam, Nadam, and AdamW optimizers maintain consistent and high performance across most classes.
    • Lower precision and recall are observed in category 0 across all optimizers, suggesting challenges in correctly identifying or separating this class.

Key Insights:¶

  1. Optimizer Performance Ranking:

    • Top Performers: Nadam and Adam are the most effective optimizers for this task. They achieve a balance between loss minimization and accuracy, along with robust classification metrics.
    • Middle Performers: AdamW shows good classification metrics but slightly lower accuracy compared to Nadam and Adam.
    • Underperformers: SGD and RMSprop struggle with both accuracy and loss, making them less suitable for this task.
  2. Class-Specific Challenges:

    • Lower precision and recall in category 0 across all optimizers suggest that this class might be underrepresented in the dataset or has overlapping features with other classes.
    • Classes 1 and 4 consistently perform well, likely due to better feature representation and higher support.
  3. Loss and Accuracy Correlation:

    • Lower test losses generally correspond to higher test accuracies, as seen with Nadam and Adam, reaffirming their suitability for this task.

Recommendations:¶

  1. Model Selection:

    • Choose Nadam or Adam for final deployment, as they deliver the best test accuracy and classification performance.
  2. Addressing Class Imbalance or Feature Overlap:

    • Investigate and address challenges in category 0 (e.g., underrepresentation or feature overlap). Consider data augmentation or feature engineering to enhance separation for this class.
  1. Further Optimization:

    • Explore fine-tuning hyperparameters for Nadam and Adam to improve performance further, particularly focusing on test loss reduction.
  2. Potential Enhancements:

    • Employ techniques like early stopping or dropout to mitigate overfitting, as perfect training metrics suggest potential overfitting risk.
    • Consider ensembling the best-performing models (e.g., Nadam and Adam) to leverage their strengths.

Classification Reports for Base RNN Classifier.

In [ ]:
# Predict on train and test data for each optimizer
y_pred_train = {}
y_pred_test = {}
for opt, model in rnn_models.items():
    y_pred_train[opt] = np.argmax(model.predict(X_train_scaled), axis=1)
    y_pred_test[opt] = np.argmax(model.predict(X_test_scaled), axis=1)

# Generate classification reports
for opt in optimizers:
    print(f"\nClassification Report for RNN Model with {opt} optimizer:")
    train_report = classification_report(y_train_encoded, y_pred_train[opt], output_dict=True)
    test_report = classification_report(y_test_encoded, y_pred_test[opt], output_dict=True)

    # Create DataFrames for better visualization
    train_df = pd.DataFrame(train_report).transpose()
    test_df = pd.DataFrame(test_report).transpose()

    # Rename columns
    train_df.columns = ['Train_' + col for col in train_df.columns]
    test_df.columns = ['Test_' + col for col in test_df.columns]

    # Concatenate DataFrames
    combined_df = pd.concat([train_df, test_df], axis=1)

    # Display the combined report
    display(combined_df)
    print("\n" * 3)
39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 16ms/step
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 22ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 16ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 15ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 12ms/step
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 15ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 12ms/step
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 16ms/step

Classification Report for RNN Model with sgd optimizer:
Train_precision Train_recall Train_f1-score Train_support Test_precision Test_recall Test_f1-score Test_support
0 0.696486 0.832061 0.758261 262.000000 0.487179 0.808511 0.608000 47.000000
1 0.729592 0.605932 0.662037 236.000000 0.800000 0.547945 0.650407 73.000000
2 0.681648 0.719368 0.700000 253.000000 0.514706 0.625000 0.564516 56.000000
3 0.792627 0.707819 0.747826 243.000000 0.794872 0.469697 0.590476 66.000000
4 0.888889 0.892562 0.890722 242.000000 0.797297 0.880597 0.836879 67.000000
accuracy 0.753236 0.753236 0.753236 0.753236 0.656958 0.656958 0.656958 0.656958
macro avg 0.757848 0.751548 0.751769 1236.000000 0.678811 0.666350 0.650056 309.000000
weighted avg 0.756342 0.753236 0.751846 1236.000000 0.699034 0.656958 0.656022 309.000000




Classification Report for RNN Model with rmsprop optimizer:
Train_precision Train_recall Train_f1-score Train_support Test_precision Test_recall Test_f1-score Test_support
0 0.740964 0.938931 0.828283 262.000000 0.547945 0.851064 0.666667 47.000000
1 0.826087 0.644068 0.723810 236.000000 0.745763 0.602740 0.666667 73.000000
2 0.793388 0.758893 0.775758 253.000000 0.705882 0.642857 0.672897 56.000000
3 0.822222 0.761317 0.790598 243.000000 0.765957 0.545455 0.637168 66.000000
4 0.928854 0.971074 0.949495 242.000000 0.810127 0.955224 0.876712 67.000000
accuracy 0.817152 0.817152 0.817152 0.817152 0.711974 0.711974 0.711974 0.711974
macro avg 0.822303 0.814857 0.813589 1236.000000 0.715135 0.719468 0.704022 309.000000
weighted avg 0.820711 0.817152 0.813907 1236.000000 0.726716 0.711974 0.707039 309.000000




Classification Report for RNN Model with adam optimizer:
Train_precision Train_recall Train_f1-score Train_support Test_precision Test_recall Test_f1-score Test_support
0 0.729483 0.916031 0.812183 262.000000 0.500000 0.893617 0.641221 47.000000
1 0.630769 0.521186 0.570766 236.000000 0.566038 0.410959 0.476190 73.000000
2 0.677824 0.640316 0.658537 253.000000 0.490909 0.482143 0.486486 56.000000
3 0.651163 0.691358 0.670659 243.000000 0.606557 0.560606 0.582677 66.000000
4 0.925581 0.822314 0.870897 242.000000 0.928571 0.776119 0.845528 67.000000
accuracy 0.721683 0.721683 0.721683 0.721683 0.608414 0.608414 0.608414 0.608414
macro avg 0.722964 0.718241 0.716608 1236.000000 0.618415 0.624689 0.606421 309.000000
weighted avg 0.723057 0.721683 0.718309 1236.000000 0.629640 0.608414 0.605986 309.000000




Classification Report for RNN Model with nadam optimizer:
Train_precision Train_recall Train_f1-score Train_support Test_precision Test_recall Test_f1-score Test_support
0 0.782007 0.862595 0.820327 262.000000 0.620690 0.765957 0.685714 47.000000
1 0.827273 0.771186 0.798246 236.000000 0.636364 0.671233 0.653333 73.000000
2 0.833333 0.869565 0.851064 253.000000 0.730769 0.678571 0.703704 56.000000
3 0.847534 0.777778 0.811159 243.000000 0.687500 0.500000 0.578947 66.000000
4 0.916667 0.909091 0.912863 242.000000 0.810811 0.895522 0.851064 67.000000
accuracy 0.838997 0.838997 0.838997 0.838997 0.699029 0.699029 0.699029 0.699029
macro avg 0.841363 0.838043 0.838732 1236.000000 0.697227 0.702257 0.694553 309.000000
weighted avg 0.840404 0.838997 0.838718 1236.000000 0.699836 0.699029 0.694373 309.000000




Classification Report for RNN Model with adamw optimizer:
Train_precision Train_recall Train_f1-score Train_support Test_precision Test_recall Test_f1-score Test_support
0 0.769784 0.816794 0.792593 262.00000 0.514286 0.765957 0.615385 47.000000
1 0.854369 0.745763 0.796380 236.00000 0.849057 0.616438 0.714286 73.000000
2 0.766798 0.766798 0.766798 253.00000 0.692308 0.642857 0.666667 56.000000
3 0.816000 0.839506 0.827586 243.00000 0.718750 0.696970 0.707692 66.000000
4 0.943775 0.971074 0.957230 242.00000 0.928571 0.970149 0.948905 67.000000
accuracy 0.827670 0.827670 0.827670 0.82767 0.737864 0.737864 0.737864 0.737864
macro avg 0.830145 0.827987 0.828117 1236.00000 0.740594 0.738474 0.730587 309.000000
weighted avg 0.828476 0.827670 0.827151 1236.00000 0.759138 0.737864 0.740076 309.000000



Impact of Optimizers:

The code aims to compare how these optimizers influence the final model accuracy and performance metrics. Each optimizer updates the model's weights during training to minimize the loss function, but they do so in different ways.

  • Adam: Generally a good default choice, Adam combines the benefits of other optimizers like Momentum and RMSprop. It adapts the learning rate for each parameter individually. The results often indicate strong performance across most classes.

  • SGD (Stochastic Gradient Descent): The most basic optimizer. It updates weights based on the gradient of the loss function calculated from a single random sample (or a small batch). It may require careful tuning of the learning rate and momentum to achieve good performance. Results might show instability or overfitting.

  • RMSprop: Another adaptive learning rate optimization algorithm that divides the learning rate by an exponentially decaying average of squared gradients. It helps mitigate issues with oscillating gradients. Often good at avoiding local minima, but sometimes it lacks precision for certain categories.

  • Nadam: Combines Adam and Nesterov Momentum. Nesterov Momentum looks ahead to where the parameter will be in the next step, and it adjusts the updates accordingly. This can improve learning speed, especially for RNNs where time-dependency matters.

Analyzing the Results:

The classification reports, presented as combined train/test dataframes, allow you to compare each optimizer’s strengths and weaknesses across various classes. Look for these patterns in your output:

  1. Overfitting: Compare train and test scores. A significant difference (high training accuracy but low testing accuracy) suggests overfitting to training data.

  2. Class-specific Performance: Assess which optimizer excels in different classes. This helps understand which optimizers have more difficulties handling specific aspects of the dataset.

  3. Macro and Weighted Averages: These provide overall performance insights. Look for a good balance between precision and recall and examine if class imbalance is affecting the weighted average scores.

In summary: The code tries to identify which optimizer leads to the best overall performance by considering training and testing classification reports. The analysis will highlight the relative strengths and weaknesses of the different optimizers in this context.

Adam consistently provides the highest F1-scores, accuracy, and a reasonable precision/recall balance for most classes on the test set. Additionally, the confusion matrices for Adam after 50 epochs show minimal misclassifications. Then Adam might be the best choice.

Train and Test Confusion Matrices for RNN Classifier for all optimisers.

In [ ]:
y_pred_train = {}
y_pred_test = {}
for opt, rnn_model in rnn_models.items():
    y_pred_train[opt] = np.argmax(rnn_model.predict(X_train_scaled), axis=1)
    y_pred_test[opt] = np.argmax(rnn_model.predict(X_test_scaled), axis=1)

# Generate confusion matrices
for opt in optimizers:
    print(f"\nConfusion Matrices for Base RNN with {opt} optimizer:")
    cm_train = confusion_matrix(y_train_encoded, y_pred_train[opt])
    cm_test = confusion_matrix(y_test_encoded, y_pred_test[opt])

    fig, axes = plt.subplots(1, 2, figsize=(12, 6))

    # Train Confusion Matrix
    sns.heatmap(cm_train, annot=True, fmt="d", cmap="Greens", square=True, ax=axes[0])
    axes[0].set_title(f"Train Confusion Matrix - {opt}", fontsize = 10)
    axes[0].set_xlabel("Predicted Labels")
    axes[0].set_ylabel("True Labels")

    # Test Confusion Matrix
    sns.heatmap(cm_test, annot=True, fmt="d", cmap="Greens", square=True, ax=axes[1])
    axes[1].set_title(f"Test Confusion Matrix - {opt}", fontsize = 10)
    axes[1].set_xlabel("Predicted Labels")
    axes[1].set_ylabel("True Labels")

    # Add space between matrices
    plt.subplots_adjust(wspace=1.5)

    plt.tight_layout()
    plt.show()
    print("\n" * 3)
39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 16ms/step
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 17ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 15ms/step
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 20ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 16ms/step
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 19ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 18ms/step
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 14ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 14ms/step
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 13ms/step

Confusion Matrices for Base RNN with sgd optimizer:




Confusion Matrices for Base RNN with rmsprop optimizer:




Confusion Matrices for Base RNN with adam optimizer:




Confusion Matrices for Base RNN with nadam optimizer:




Confusion Matrices for Base RNN with adamw optimizer:



HyperTuned RNN Classifier

In [ ]:
!pip install scikeras
from scikeras.wrappers import KerasClassifier
from sklearn.model_selection import RandomizedSearchCV
import warnings
warnings.filterwarnings("ignore")


X_train_reshaped = X_train_scaled.reshape(X_train_scaled.shape[0], 1, X_train_scaled.shape[1])
X_test_reshaped = X_test_scaled.reshape(X_test_scaled.shape[0], 1, X_test_scaled.shape[1])


def create_rnn_model(units=32, activation='tanh', optimizer='adam'):
    model = Sequential()
    model.add(SimpleRNN(units=units, input_shape=(X_train_reshaped.shape[1], X_train_reshaped.shape[2]), activation=activation))
    model.add(Dense(units=y_train_onehot.shape[1], activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model

# Wrap Keras model with KerasClassifier
rnn_clf = KerasClassifier(
    model=create_rnn_model,
    verbose=0,
    batch_size=32,
    epochs=10
)

# Adjust Hyperparameters
param_dist = {
    'model__units': [16, 32, 64, 128],
    'model__activation': ['relu', 'tanh'],
    'model__optimizer': ['adam', 'rmsprop', 'nadam'],
    'batch_size': [16, 32, 64],
    'epochs': [10, 20, 30]
}

random_search = RandomizedSearchCV(
    estimator=rnn_clf, param_distributions=param_dist,
    n_iter=10, cv=3, verbose=1, n_jobs=-1, error_score='raise'
)

# Fit the model using RandomizedSearchCV
random_search_result = random_search.fit(X_train_reshaped, y_train_onehot)
best_hyperparameters = random_search_result.best_params_

print("Best Hyperparameters:", random_search_result.best_params_)
print("Best Cross-Validation Accuracy:", random_search_result.best_score_)
Collecting scikeras
  Downloading scikeras-0.13.0-py3-none-any.whl.metadata (3.1 kB)
Requirement already satisfied: keras>=3.2.0 in /usr/local/lib/python3.10/dist-packages (from scikeras) (3.5.0)
Requirement already satisfied: scikit-learn>=1.4.2 in /usr/local/lib/python3.10/dist-packages (from scikeras) (1.5.2)
Requirement already satisfied: absl-py in /usr/local/lib/python3.10/dist-packages (from keras>=3.2.0->scikeras) (1.4.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from keras>=3.2.0->scikeras) (1.26.4)
Requirement already satisfied: rich in /usr/local/lib/python3.10/dist-packages (from keras>=3.2.0->scikeras) (13.9.4)
Requirement already satisfied: namex in /usr/local/lib/python3.10/dist-packages (from keras>=3.2.0->scikeras) (0.0.8)
Requirement already satisfied: h5py in /usr/local/lib/python3.10/dist-packages (from keras>=3.2.0->scikeras) (3.12.1)
Requirement already satisfied: optree in /usr/local/lib/python3.10/dist-packages (from keras>=3.2.0->scikeras) (0.13.1)
Requirement already satisfied: ml-dtypes in /usr/local/lib/python3.10/dist-packages (from keras>=3.2.0->scikeras) (0.4.1)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from keras>=3.2.0->scikeras) (24.2)
Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.4.2->scikeras) (1.13.1)
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.4.2->scikeras) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.4.2->scikeras) (3.5.0)
Requirement already satisfied: typing-extensions>=4.5.0 in /usr/local/lib/python3.10/dist-packages (from optree->keras>=3.2.0->scikeras) (4.12.2)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from rich->keras>=3.2.0->scikeras) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich->keras>=3.2.0->scikeras) (2.18.0)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py>=2.2.0->rich->keras>=3.2.0->scikeras) (0.1.2)
Downloading scikeras-0.13.0-py3-none-any.whl (26 kB)
Installing collected packages: scikeras
Successfully installed scikeras-0.13.0
Fitting 3 folds for each of 10 candidates, totalling 30 fits
Best Hyperparameters: {'model__units': 128, 'model__optimizer': 'rmsprop', 'model__activation': 'relu', 'epochs': 30, 'batch_size': 32}
Best Cross-Validation Accuracy: 0.9538834951456311

Train vs Validation plots for Accuracy and Loss for HyperTuned RNN Classifier

In [ ]:
y_train_categorical = to_categorical(y_train_encoded)
y_test_categorical = to_categorical(y_test_encoded)

best_hyperparameters = random_search_result.best_params_
best_model = create_rnn_model(
    units=best_hyperparameters['model__units'],
    activation=best_hyperparameters['model__activation'],
    optimizer=best_hyperparameters['model__optimizer']
)

# Train the best model
hp_history = best_model.fit(X_train_reshaped, y_train_onehot, epochs=best_hyperparameters['epochs'], batch_size=best_hyperparameters['batch_size'],validation_split=0.2, verbose=0)



'''
hp_history = best_model.fit(X_train_reshaped, y_train_categorical,
                          epochs=50,
                          batch_size=32,
                          validation_data=(X_test_reshaped, y_test_categorical))
'''


# Evaluate the model on the test set
loss, accuracy = best_model.evaluate(X_test_reshaped, y_test_categorical)
print("Test Loss:", loss)
print("Test Accuracy:", accuracy)

# Make predictions on the test set
y_pred_prob = best_model.predict(X_test_reshaped)
y_pred = np.argmax(y_pred_prob, axis=1)  # Get the predicted class labels

# Decode the predicted labels back to original
y_pred_decoded = label_encoder.inverse_transform(y_pred)

# Print some predictions
print("Predicted labels:", y_pred_decoded)
print("True labels:", label_encoder.inverse_transform(y_test_encoded))

# Plot training & validation accuracy values
import matplotlib.pyplot as plt

# Plot training & validation accuracy values
plt.figure(figsize=(12, 5))

# Accuracy Plot
plt.subplot(1, 2, 1)
plt.plot(hp_history.history['accuracy'], label='Train Accuracy')
plt.plot(hp_history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Training vs Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.grid()

# Loss Plot
plt.subplot(1, 2, 2)
plt.plot(hp_history.history['loss'], label='Train Loss')
plt.plot(hp_history.history['val_loss'], label='Validation Loss')
plt.title('Training vs Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.grid()

# Show the plots
plt.tight_layout()
plt.show()
10/10 ━━━━━━━━━━━━━━━━━━━━ 1s 59ms/step - accuracy: 0.9708 - loss: 0.0979
Test Loss: 0.12484573572874069
Test Accuracy: 0.9644013047218323
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 28ms/step
Predicted labels: [4 1 4 1 3 2 4 4 3 3 1 4 4 0 0 3 0 2 1 2 1 1 1 0 1 1 3 2 2 1 3 4 2 4 0 2 4
 1 4 1 1 2 0 1 4 0 2 4 3 2 1 1 3 3 0 1 1 1 3 3 2 2 1 3 0 1 0 2 4 2 2 0 1 4
 0 0 3 3 4 3 3 0 3 2 0 4 2 1 2 3 4 3 4 4 4 1 4 3 2 2 2 4 1 4 4 2 4 2 3 1 1
 1 3 3 1 1 1 1 3 4 0 1 0 4 2 4 4 2 3 3 1 1 3 1 0 4 3 3 3 3 4 2 3 0 0 0 2 4
 2 3 4 0 1 3 2 0 3 4 1 3 4 4 1 2 4 1 0 4 2 3 3 3 1 4 1 0 3 4 4 2 3 2 3 4 3
 1 2 1 0 3 3 3 4 2 3 4 3 2 1 2 1 1 0 0 3 3 1 0 3 2 2 0 0 2 0 2 4 1 4 1 4 0
 3 4 3 2 4 1 3 1 1 0 0 4 2 2 3 4 1 0 1 4 4 2 4 1 4 0 3 1 0 2 1 3 4 0 2 3 2
 2 0 1 1 0 4 4 4 1 3 4 3 1 1 3 1 2 2 1 0 3 1 4 0 1 3 0 3 2 4 1 4 4 4 2 3 0
 0 1 4 0 2 4 1 2 3 2 4 2 2]
True labels: [4 1 4 1 3 2 4 4 3 3 1 4 4 0 0 3 0 2 1 2 1 1 1 0 1 1 3 2 2 1 3 4 2 4 0 2 4
 1 4 1 1 2 0 1 4 0 2 4 3 2 1 1 3 3 0 1 1 1 3 3 2 2 1 0 0 1 0 2 4 2 2 0 1 4
 0 3 3 3 4 3 3 0 3 0 0 4 2 1 2 3 4 3 4 4 4 1 4 3 2 2 0 4 1 4 4 2 4 2 3 1 1
 1 3 3 1 1 1 1 3 4 0 1 0 4 2 4 4 2 3 3 1 1 3 1 3 4 3 3 3 3 4 2 3 0 0 0 2 4
 2 3 4 0 1 3 2 0 3 4 1 3 0 4 1 2 4 1 0 4 2 3 3 3 1 4 1 0 3 4 4 2 3 2 3 4 3
 1 2 1 0 3 3 3 4 2 0 4 1 2 1 2 1 1 0 0 3 3 1 0 3 2 2 0 0 2 0 2 4 1 4 1 4 0
 3 4 3 2 4 1 3 1 1 0 0 4 2 2 3 4 1 0 1 4 4 2 4 1 4 1 3 1 0 2 1 3 4 0 2 3 2
 2 1 1 1 0 4 4 4 1 3 4 3 1 1 3 1 2 2 1 3 3 1 4 0 1 3 0 3 2 4 1 4 4 4 2 3 0
 0 1 4 0 2 4 1 2 3 2 4 2 2]

1. Accuracy Plot (Left Panel):

  • Training Accuracy: The blue line shows that the training accuracy quickly increases and stabilizes near 1.0 by around the 5th epoch.

  • Validation Accuracy: The orange line shows that validation accuracy also improves but stabilizes around 96%, with some minor fluctuations after the 10th epoch.

2. Loss Plot (Right Panel):

  • Training Loss: The blue line demonstrates a steep decline early on and flattens out near 0 around the 10th epoch.
  • Validation Loss: The orange line decreases significantly during the initial epochs but stabilizes and shows minor increases from around epoch 15 onward, suggesting possible overfitting.

Classification Reports for HyperTuned RNN Classifier.

In [ ]:
y_pred_test = np.argmax(best_model.predict(X_test_reshaped), axis=1)
y_pred_train = np.argmax(best_model.predict(X_train_reshaped), axis=1)

print(f"\nClassification Report for Hyperparameter tuned RNN model:")
test_report = classification_report(y_test_encoded, y_pred_test, output_dict=True)
train_report = classification_report(y_train_encoded, y_pred_train, output_dict=True)


# Create DataFrames for better visualization
train_df = pd.DataFrame(train_report).transpose()
test_df = pd.DataFrame(test_report).transpose()

# Rename columns
train_df.columns = ['Train_' + col for col in train_df.columns]
test_df.columns = ['Test_' + col for col in test_df.columns]

# Concatenate DataFrames
combined_df = pd.concat([train_df, test_df], axis=1)

# Display the combined report
display(combined_df)
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 

Classification Report for Hyperparameter tuned RNN model:
Train_precision Train_recall Train_f1-score Train_support Test_precision Test_recall Test_f1-score Test_support
0 0.984791 0.988550 0.986667 262.0000 0.893617 0.893617 0.893617 47.000000
1 0.995708 0.983051 0.989339 236.0000 1.000000 0.958904 0.979021 73.000000
2 0.992095 0.992095 0.992095 253.0000 0.965517 1.000000 0.982456 56.000000
3 0.983673 0.991770 0.987705 243.0000 0.954545 0.954545 0.954545 66.000000
4 1.000000 1.000000 1.000000 242.0000 0.985294 1.000000 0.992593 67.000000
accuracy 0.991100 0.991100 0.991100 0.9911 0.964401 0.964401 0.964401 0.964401
macro avg 0.991253 0.991093 0.991161 1236.0000 0.959795 0.961413 0.960446 309.000000
weighted avg 0.991129 0.991100 0.991103 1236.0000 0.964672 0.964401 0.964368 309.000000

1. Training Set Metrics:

  • Precision: Ranges from 0.98 to 1.00, indicating that most of the predicted positive cases are correct.
  • Recall: Also high, between 0.98 and 1.00, meaning the model successfully identifies most actual positives.
  • F1-Score: All values are around 0.98-1.00, confirming a well-balanced performance between precision and recall for all classes.
  • Accuracy: Overall training accuracy is 99.11%, indicating a strong fit on the training data.

2. Test Set Metrics:

  • Precision: Class 1 and Class 4 show excellent precision at 1.00 and 0.99, suggesting near-perfect classification. Class 0 shows the lowest precision at 0.89, indicating some false positives.
  • Recall: Class 2 and Class 4 have perfect recall (1.00), meaning the model captures all actual positives. Class 1 and Class 3 have slightly lower recall (0.95), meaning a few actual positives are missed.
  • F1-Score: Class 0 has the lowest F1-score at 0.89, showing that it may struggle with both false positives and false negatives. Other classes perform well, with F1-scores ranging from 0.95 to 0.99, indicating balanced performance.

3. Accuracy:

  • The overall test accuracy is 96.44%, which is strong but lower than the training accuracy, suggesting the model generalizes well but shows minor overfitting signs.

Macro Average vs. Weighted Average:

  • Macro Average: Precision: 0.96, Recall: 0.96, F1-Score: 0.96. Reflects that the model performs consistently across all classes without considering class support.
  • Weighted Average: Precision: 0.96, Recall: 0.96, F1-Score: 0.96. The class support is considered, meaning performance is well-balanced, even with varying class distributions.

Train and Test Confusion Matrices for Hypertuned RNN Classifier.

In [ ]:
#Generate Confusion Matrix
cm_test = confusion_matrix(y_test_encoded, y_pred_test)
cm_train = confusion_matrix(y_train_encoded, y_pred_train)

fig, axes = plt.subplots(1, 2, figsize=(12, 6))

# Train Confusion Matrix
sns.heatmap(cm_train, annot=True, fmt="d", cmap="Greens", square=True, ax=axes[0])
axes[0].set_title(f"Train Confusion Matrix for Hyperparameter tuned RNN model", fontsize = 10)
axes[0].set_xlabel("Predicted Labels")
axes[0].set_ylabel("True Labels")

# Test Confusion Matrix
sns.heatmap(cm_test, annot=True, fmt="d", cmap="Greens", square=True, ax=axes[1])
axes[1].set_title(f"Test Confusion Matrix for Hyperparameter tuned RNN Model", fontsize = 10)
axes[1].set_xlabel("Predicted Labels")
axes[1].set_ylabel("True Labels")

# Add space between matrices
plt.subplots_adjust(wspace=1.5)

plt.tight_layout()
plt.show()

Detailed Insights from the Confusion Matrices:

1. Train Confusion Matrix (Left Panel):

  • Class 0: 259 correctly classified, 3 misclassified (1 as Class 1, 1 as Class 2, and 1 as Class 3). Very few misclassifications, showing strong performance on this class.
  • Class 1: 232 correctly classified, 4 misclassified (1 as Class 0, 3 as Class 3). A small number of misclassifications, mostly confused with Class 3.
  • Class 2: 251 correctly classified, 2 misclassified as Class 0. High precision and recall due to minimal misclassification.
  • Class 3: 241 correctly classified, 2 misclassified (1 as Class 0 and 1 as Class 2). Almost perfect classification performance.
  • Class 4: 242 correctly classified, no misclassifications. Perfect classification for Class 4 on the training set.

2. Test Confusion Matrix (Right Panel):

  • Class 0: 42 correctly classified, 5 misclassified (2 as Class 2, 2 as Class 3, and 1 as Class 4). Some confusion with Classes 2, 3, and 4, which may indicate overlapping feature space.
  • Class 1: 70 correctly classified, 3 misclassified (2 as Class 0 and 1 as Class 4). Precision is high, though there's minor confusion with Class 0.
  • Class 2: 56 correctly classified, no misclassifications. Perfect classification in the test set for Class 2.
  • Class 3: 63 correctly classified, 3 misclassified as Class 0. Slight confusion with Class 0 but overall strong performance.
  • Class 4: 67 correctly classified, no misclassifications. Perfect precision and recall for this class in the test set.

Key Observations:

High Performance Overall:

  • Both confusion matrices show strong classification, with the majority of predictions being correct across all classes.
  • Class 0 Challenges: There are minor issues with Class 0 in the test set, where it is misclassified as Class 2, 3, or 4, leading to lower precision.
  • Perfect Classification for Class 2 and 4 in Test: Both Class 2 and Class 4 exhibit perfect classification in the test set, indicating the model’s strength in handling these classes.

Recommendations:

Improve Class 0 Handling:

  • Consider additional feature engineering or rebalancing strategies. Focus on improving class separability in feature space.

Regularization & Augmentation:

  • Apply regularization techniques to mitigate overfitting tendencies seen in training.
  • Data augmentation can help generalize further and reduce minor errors in Classes 0 and 1.

Base LSTM Classifier¶

In [ ]:
df = pd.read_csv('/content/drive/MyDrive/AIML_Capstone_Project/Final_NLP_Glove_df.csv')
In [ ]:
df.head()
Out[ ]:
WeekofYear Weekend GloVe_0 GloVe_1 GloVe_2 GloVe_3 GloVe_4 GloVe_5 GloVe_6 GloVe_7 ... Weekday_Monday Weekday_Saturday Weekday_Sunday Weekday_Thursday Weekday_Tuesday Weekday_Wednesday Season_Spring Season_Summer Season_Winter Accident Level
0 53 0 0.078223 0.040773 -0.041107 -0.293287 -0.148195 -0.085006 0.120392 -0.043692 ... 0 0 0 0 0 0 0 1 0 0
1 53 1 -0.047137 0.109611 -0.049147 -0.199018 0.049427 -0.139335 0.039627 -0.095639 ... 0 1 0 0 0 0 0 1 0 0
2 1 0 -0.057290 0.202640 -0.209550 -0.169683 -0.027187 -0.091942 -0.168629 -0.005628 ... 0 0 0 0 0 1 0 1 0 0
3 1 0 -0.033755 0.019709 -0.029097 -0.216930 -0.088179 -0.137728 -0.017687 0.012178 ... 0 0 0 0 0 0 0 1 0 0
4 1 1 -0.099598 0.082313 -0.132139 -0.090341 -0.122124 -0.055800 0.132037 0.086205 ... 0 0 1 0 0 0 0 1 0 3

5 rows × 362 columns

In [ ]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping

# Features and target
X = df.drop('Accident Level', axis=1).values
y = df['Accident Level'].values

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Encode the target variable
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)
y_train_categorical = to_categorical(y_train_encoded)
y_test_categorical = to_categorical(y_test_encoded)

# Reshape the data for LSTM input (samples, time steps, features)
# Assuming a single time step for simplicity
X_train_reshaped = X_train_scaled.reshape(X_train_scaled.shape[0], 1, X_train_scaled.shape[1])
X_test_reshaped = X_test_scaled.reshape(X_test_scaled.shape[0], 1, X_test_scaled.shape[1])

# Define the LSTM model
model = Sequential()
model.add(LSTM(64, input_shape=(X_train_reshaped.shape[1], X_train_reshaped.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(32))
model.add(Dropout(0.2))
model.add(Dense(y_train_categorical.shape[1], activation='softmax'))

# Compile the model
#model.compile(optimizer=Adam(lr=0.001), loss='categorical_crossentropy', metrics=['accuracy'])
model.compile(optimizer=Adam(learning_rate=0.001), loss='categorical_crossentropy', metrics=['accuracy'])
# Set up EarlyStopping
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# Train the model with EarlyStopping
epochs = 50
batch_size = 64
history = model.fit(X_train_reshaped, y_train_categorical,
                    epochs=epochs,
                    batch_size=batch_size,
                    validation_data=(X_test_reshaped, y_test_categorical),
                    callbacks=[early_stopping])

# Evaluate the model on the test set
loss, accuracy = model.evaluate(X_test_reshaped, y_test_categorical)
print("Test Loss:", loss)
print("Test Accuracy:", accuracy)

# Make predictions on the test set
y_pred_prob = model.predict(X_test_reshaped)
y_pred = np.argmax(y_pred_prob, axis=1)  # Get the predicted class labels

# Decode the predicted labels back to original
y_pred_decoded = label_encoder.inverse_transform(y_pred)

# Print some predictions
print("Predicted labels:", y_pred_decoded)
print("True labels:", label_encoder.inverse_transform(y_test_encoded))
Epoch 1/50
/usr/local/lib/python3.10/dist-packages/keras/src/layers/rnn/rnn.py:200: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
  super().__init__(**kwargs)
20/20 ━━━━━━━━━━━━━━━━━━━━ 6s 26ms/step - accuracy: 0.4192 - loss: 1.5704 - val_accuracy: 0.7929 - val_loss: 1.4251
Epoch 2/50
20/20 ━━━━━━━━━━━━━━━━━━━━ 2s 11ms/step - accuracy: 0.8051 - loss: 1.3621 - val_accuracy: 0.8447 - val_loss: 1.1690
Epoch 3/50
20/20 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - accuracy: 0.8797 - loss: 1.1078 - val_accuracy: 0.8803 - val_loss: 0.9099
Epoch 4/50
20/20 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step - accuracy: 0.9231 - loss: 0.8278 - val_accuracy: 0.9320 - val_loss: 0.6672
Epoch 5/50
20/20 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.9502 - loss: 0.5974 - val_accuracy: 0.9450 - val_loss: 0.4542
Epoch 6/50
20/20 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step - accuracy: 0.9846 - loss: 0.3719 - val_accuracy: 0.9644 - val_loss: 0.3078
Epoch 7/50
20/20 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step - accuracy: 0.9952 - loss: 0.2411 - val_accuracy: 0.9709 - val_loss: 0.2196
Epoch 8/50
20/20 ━━━━━━━━━━━━━━━━━━━━ 0s 15ms/step - accuracy: 0.9923 - loss: 0.1492 - val_accuracy: 0.9709 - val_loss: 0.1694
Epoch 9/50
20/20 ━━━━━━━━━━━━━━━━━━━━ 0s 14ms/step - accuracy: 0.9958 - loss: 0.1057 - val_accuracy: 0.9709 - val_loss: 0.1400
Epoch 10/50
20/20 ━━━━━━━━━━━━━━━━━━━━ 1s 15ms/step - accuracy: 0.9969 - loss: 0.0796 - val_accuracy: 0.9676 - val_loss: 0.1289
Epoch 11/50
20/20 ━━━━━━━━━━━━━━━━━━━━ 1s 16ms/step - accuracy: 0.9984 - loss: 0.0649 - val_accuracy: 0.9709 - val_loss: 0.1155
Epoch 12/50
20/20 ━━━━━━━━━━━━━━━━━━━━ 0s 18ms/step - accuracy: 0.9962 - loss: 0.0513 - val_accuracy: 0.9676 - val_loss: 0.1132
Epoch 13/50
20/20 ━━━━━━━━━━━━━━━━━━━━ 0s 15ms/step - accuracy: 0.9984 - loss: 0.0402 - val_accuracy: 0.9644 - val_loss: 0.1081
Epoch 14/50
20/20 ━━━━━━━━━━━━━━━━━━━━ 1s 17ms/step - accuracy: 0.9985 - loss: 0.0317 - val_accuracy: 0.9676 - val_loss: 0.1042
Epoch 15/50
20/20 ━━━━━━━━━━━━━━━━━━━━ 1s 16ms/step - accuracy: 0.9954 - loss: 0.0351 - val_accuracy: 0.9644 - val_loss: 0.1045
Epoch 16/50
20/20 ━━━━━━━━━━━━━━━━━━━━ 0s 13ms/step - accuracy: 0.9955 - loss: 0.0271 - val_accuracy: 0.9644 - val_loss: 0.1032
Epoch 17/50
20/20 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.9950 - loss: 0.0275 - val_accuracy: 0.9709 - val_loss: 0.1015
Epoch 18/50
20/20 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step - accuracy: 0.9967 - loss: 0.0246 - val_accuracy: 0.9644 - val_loss: 0.1001
Epoch 19/50
20/20 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.9963 - loss: 0.0200 - val_accuracy: 0.9644 - val_loss: 0.1050
Epoch 20/50
20/20 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - accuracy: 0.9971 - loss: 0.0191 - val_accuracy: 0.9644 - val_loss: 0.1046
Epoch 21/50
20/20 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - accuracy: 0.9935 - loss: 0.0253 - val_accuracy: 0.9644 - val_loss: 0.1037
Epoch 22/50
20/20 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - accuracy: 0.9988 - loss: 0.0144 - val_accuracy: 0.9676 - val_loss: 0.0972
Epoch 23/50
20/20 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.9978 - loss: 0.0170 - val_accuracy: 0.9644 - val_loss: 0.1032
Epoch 24/50
20/20 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.9980 - loss: 0.0147 - val_accuracy: 0.9644 - val_loss: 0.1060
Epoch 25/50
20/20 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.9981 - loss: 0.0122 - val_accuracy: 0.9644 - val_loss: 0.1056
Epoch 26/50
20/20 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.9978 - loss: 0.0131 - val_accuracy: 0.9644 - val_loss: 0.1057
Epoch 27/50
20/20 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.9987 - loss: 0.0090 - val_accuracy: 0.9644 - val_loss: 0.1033
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.9719 - loss: 0.0920 
Test Loss: 0.09718480706214905
Test Accuracy: 0.9676375389099121
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 20ms/step
Predicted labels: [4 1 4 1 3 2 4 4 3 3 1 4 4 0 0 3 0 2 1 2 1 1 1 0 1 1 3 2 2 1 3 4 2 4 0 2 4
 1 4 1 1 2 0 1 4 0 2 4 3 2 1 1 3 3 0 1 1 1 3 3 2 2 1 3 0 1 0 2 4 2 2 0 1 4
 0 0 3 3 4 3 3 0 3 2 0 4 2 1 2 3 4 3 4 4 4 1 4 3 2 2 3 4 1 4 4 2 4 2 3 1 1
 1 3 3 1 1 1 1 3 4 0 1 0 4 2 4 4 2 3 1 1 1 3 1 0 4 3 3 3 3 4 2 3 0 0 0 2 4
 2 3 4 0 1 3 2 0 3 4 1 3 0 4 1 2 4 1 0 4 2 3 3 3 1 4 1 0 3 4 4 2 3 2 3 4 3
 1 2 1 0 3 3 3 4 2 3 4 1 2 1 2 1 1 0 0 3 3 1 0 3 2 2 0 0 2 0 2 4 1 4 1 4 0
 3 4 3 2 4 1 3 1 1 0 0 4 2 2 3 4 1 0 1 4 4 2 4 1 4 0 3 1 0 2 1 3 4 0 2 3 2
 2 0 1 1 3 4 4 4 1 3 4 3 1 1 3 1 2 2 1 3 3 1 4 0 1 3 0 3 2 4 1 4 4 4 2 3 0
 0 1 4 0 2 4 1 2 3 2 4 2 2]
True labels: [4 1 4 1 3 2 4 4 3 3 1 4 4 0 0 3 0 2 1 2 1 1 1 0 1 1 3 2 2 1 3 4 2 4 0 2 4
 1 4 1 1 2 0 1 4 0 2 4 3 2 1 1 3 3 0 1 1 1 3 3 2 2 1 0 0 1 0 2 4 2 2 0 1 4
 0 3 3 3 4 3 3 0 3 0 0 4 2 1 2 3 4 3 4 4 4 1 4 3 2 2 0 4 1 4 4 2 4 2 3 1 1
 1 3 3 1 1 1 1 3 4 0 1 0 4 2 4 4 2 3 3 1 1 3 1 3 4 3 3 3 3 4 2 3 0 0 0 2 4
 2 3 4 0 1 3 2 0 3 4 1 3 0 4 1 2 4 1 0 4 2 3 3 3 1 4 1 0 3 4 4 2 3 2 3 4 3
 1 2 1 0 3 3 3 4 2 0 4 1 2 1 2 1 1 0 0 3 3 1 0 3 2 2 0 0 2 0 2 4 1 4 1 4 0
 3 4 3 2 4 1 3 1 1 0 0 4 2 2 3 4 1 0 1 4 4 2 4 1 4 1 3 1 0 2 1 3 4 0 2 3 2
 2 1 1 1 0 4 4 4 1 3 4 3 1 1 3 1 2 2 1 3 3 1 4 0 1 3 0 3 2 4 1 4 4 4 2 3 0
 0 1 4 0 2 4 1 2 3 2 4 2 2]
In [ ]:
# Calculate and print classification metrics
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Make predictions on the test set using the trained model
y_pred_prob = model.predict(X_test_reshaped)
y_pred = np.argmax(y_pred_prob, axis=1)  # Get the predicted class labels

# Decode the true labels
y_test_decoded = label_encoder.inverse_transform(y_test_encoded)

# Generate the confusion matrix
cm = confusion_matrix(y_test_decoded, label_encoder.inverse_transform(y_pred))

# Plot the confusion matrix
plt.figure(figsize=(10, 7))
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=label_encoder.classes_)
disp.plot(cmap=plt.cm.Blues, ax=plt.gca())
plt.title('Confusion Matrix')
plt.show()
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 

Classification Report for Test Set

In [ ]:
from sklearn.metrics import classification_report

# Evaluate the model on the test set
loss, accuracy = model.evaluate(X_test_reshaped, y_test_categorical)
print("Test Loss:", loss)
print("Test Accuracy:", accuracy)

# Make predictions on the test set
y_pred_prob = model.predict(X_test_reshaped)
y_pred = np.argmax(y_pred_prob, axis=1)  # Get the predicted class labels

# Decode the predicted labels back to original
y_pred_decoded = label_encoder.inverse_transform(y_pred)

# Decode true labels back to original
y_test_decoded = label_encoder.inverse_transform(y_test_encoded)

# Print classification report
print(classification_report(y_test_decoded, y_pred_decoded))
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.9719 - loss: 0.0920 
Test Loss: 0.09718480706214905
Test Accuracy: 0.9676375389099121
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 
              precision    recall  f1-score   support

           0       0.91      0.89      0.90        47
           1       0.99      0.97      0.98        73
           2       0.98      1.00      0.99        56
           3       0.94      0.95      0.95        66
           4       1.00      1.00      1.00        67

    accuracy                           0.97       309
   macro avg       0.96      0.96      0.96       309
weighted avg       0.97      0.97      0.97       309

Observations:

1. Overall Performance:

The model achieves a high accuracy of 96.76% on the test set, indicating excellent performance overall. The test loss of 0.097 is quite low, suggesting that the model is well-optimized without significant overfitting.

2. Class-wise Metrics:

  • Class 0: The recall is slightly lower than precision, meaning some instances of Class 0 are being misclassified. However, the performance is still good.
  • Class 1: High precision and recall indicate the model performs exceptionally well on Class 1.
  • Class 2: Perfect recall indicates all instances of Class 2 are correctly identified.
  • Class 3: Balanced and strong metrics show consistent performance for Class 3.
  • Class 4: Precision, Recall, and F1-Score are all 1.00, showing perfect performance on this class.

3. Macro and Weighted Averages:

  • Macro Avg: Indicates that the model performs consistently across all classes without bias toward any specific class.
  • Weighted Avg: Reflects the overall effectiveness of the model, accounting for class imbalance.

4. Support Distribution:

Class sizes (support) range from 47 to 73, showing slight class imbalance, which the model handles well.

Insights:

1. Strong Model Performance:

The high accuracy (96.76%) and weighted average metrics confirm that the Base LSTM Classifier is robust and generalizes well to unseen data.

2. Perfect Classification for Class 4:

Class 4 achieves perfect scores (precision, recall, and F1), indicating it is the easiest class for the model to classify.

3. Slight Misclassification for Class 0:

The F1-score of 0.90 for Class 0 suggests minor misclassification. This could be due to overlapping feature representations with other classes.

4. Balanced Performance:

The macro and weighted averages are closely aligned, which indicates the model maintains consistent performance across all classes, even with slight class imbalance.

5. Recall as a Focus Area:

Improving recall for Class 0 (currently 0.89) and Class 1 (currently 0.97) could enhance the model's ability to capture all relevant instances in these categories.

Classification Report for Training Set

In [ ]:
# Make predictions on the training set
y_pred_prob_train = model.predict(X_train_reshaped)
y_pred_train = np.argmax(y_pred_prob_train, axis=1)  # Get the predicted class labels for train set

# Decode the predicted labels back to original for train
y_pred_decoded_train = label_encoder.inverse_transform(y_pred_train)
y_train_decoded = label_encoder.inverse_transform(y_train_encoded)

# Classification report for train set
print("\nClassification Report for Training Set:")
print(classification_report(y_train_decoded, y_pred_decoded_train))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step

Classification Report for Training Set:
              precision    recall  f1-score   support

           0       1.00      0.99      1.00       262
           1       1.00      1.00      1.00       236
           2       1.00      1.00      1.00       253
           3       1.00      1.00      1.00       243
           4       1.00      1.00      1.00       242

    accuracy                           1.00      1236
   macro avg       1.00      1.00      1.00      1236
weighted avg       1.00      1.00      1.00      1236

Train vs Validation plots for Accuracy and Loss for Base LSTM Classifier

In [ ]:
import matplotlib.pyplot as plt

# Plot training & validation accuracy values
plt.figure(figsize=(12, 5))

# Accuracy Plot
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Training vs Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.grid()

# Loss Plot
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Training vs Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.grid()

# Show the plots
plt.tight_layout()
plt.show()

Hypertuned LSTM Classifier

In [ ]:
import pandas as pd
import numpy as np
!pip install keras-tuner
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping
from kerastuner import HyperModel, RandomSearch
from kerastuner.engine.hyperparameters import HyperParameters

# Features and target
X = df.drop('Accident Level', axis=1).values
y = df['Accident Level'].values

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Encode the target variable
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)
y_train_categorical = to_categorical(y_train_encoded)
y_test_categorical = to_categorical(y_test_encoded)

# Reshape the data for LSTM input (samples, time steps, features)
# Assuming a single time step for simplicity
X_train_reshaped = X_train_scaled.reshape(X_train_scaled.shape[0], 1, X_train_scaled.shape[1])
X_test_reshaped = X_test_scaled.reshape(X_test_scaled.shape[0], 1, X_test_scaled.shape[1])

# Define a HyperModel for LSTM
class LSTMHyperModel(HyperModel):
    def build(self, hp):
        model = Sequential()
        model.add(LSTM(units=hp.Int('units_1', min_value=32, max_value=128, step=32),
                       input_shape=(X_train_reshaped.shape[1], X_train_reshaped.shape[2]),
                       return_sequences=True))
        model.add(Dropout(hp.Float('dropout_1', 0.1, 0.5, step=0.1)))
        model.add(LSTM(units=hp.Int('units_2', min_value=16, max_value=64, step=16)))
        model.add(Dropout(hp.Float('dropout_2', 0.1, 0.5, step=0.1)))
        model.add(Dense(y_train_categorical.shape[1], activation='softmax'))

        model.compile(optimizer=Adam(hp.Float('learning_rate', 1e-4, 1e-2, sampling='LOG')),
                      loss='categorical_crossentropy',
                      metrics=['accuracy'])
        return model

# Initialize the HyperModel
hypermodel = LSTMHyperModel()

# Set up the RandomSearch
tuner = RandomSearch(
    hypermodel,
    objective='val_accuracy',
    max_trials=10,
    executions_per_trial=1,
    directory='my_dir',
    project_name='lstm_hyperparam_tuning'
)

# Set up EarlyStopping
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# Perform hyperparameter tuning
tuner.search(X_train_reshaped, y_train_categorical,
             epochs=50,
             batch_size=32,
             validation_data=(X_test_reshaped, y_test_categorical),
             callbacks=[early_stopping])

# Get the best hyperparameters
best_hyperparameters = tuner.get_best_hyperparameters(num_trials=1)[0]
print("Best Hyperparameters:")
print(f"Units Layer 1: {best_hyperparameters.get('units_1')}")
print(f"Dropout Layer 1: {best_hyperparameters.get('dropout_1')}")
print(f"Units Layer 2: {best_hyperparameters.get('units_2')}")
print(f"Dropout Layer 2: {best_hyperparameters.get('dropout_2')}")
print(f"Learning Rate: {best_hyperparameters.get('learning_rate')}")

# Build the model with the best hyperparameters
best_model = tuner.hypermodel.build(best_hyperparameters)

# Train the best model
history = best_model.fit(X_train_reshaped, y_train_categorical,
                          epochs=50,
                          batch_size=32,
                          validation_data=(X_test_reshaped, y_test_categorical),
                          callbacks=[early_stopping])

# Evaluate the model on the test set
loss, accuracy = best_model.evaluate(X_test_reshaped, y_test_categorical)
print("Test Loss:", loss)
print("Test Accuracy:", accuracy)

# Make predictions on the test set
y_pred_prob = best_model.predict(X_test_reshaped)
y_pred = np.argmax(y_pred_prob, axis=1)  # Get the predicted class labels

# Decode the predicted labels back to original
y_pred_decoded = label_encoder.inverse_transform(y_pred)

# Print some predictions
print("Predicted labels:", y_pred_decoded)
print("True labels:", label_encoder.inverse_transform(y_test_encoded))

# Plot training & validation accuracy values
import matplotlib.pyplot as plt

# Plot training & validation accuracy values
plt.figure(figsize=(12, 5))

# Accuracy Plot
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Training vs Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.grid()

# Loss Plot
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Training vs Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.grid()

# Show the plots
plt.tight_layout()
plt.show()
Trial 10 Complete [00h 00m 15s]
val_accuracy: 0.9773463010787964

Best val_accuracy So Far: 0.983818769454956
Total elapsed time: 00h 02m 24s
Best Hyperparameters:
Units Layer 1: 96
Dropout Layer 1: 0.30000000000000004
Units Layer 2: 16
Dropout Layer 2: 0.4
Learning Rate: 0.003798131205565907
Epoch 1/50
39/39 ━━━━━━━━━━━━━━━━━━━━ 4s 27ms/step - accuracy: 0.5353 - loss: 1.4013 - val_accuracy: 0.8964 - val_loss: 0.7057
Epoch 2/50
39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 16ms/step - accuracy: 0.9181 - loss: 0.5692 - val_accuracy: 0.9450 - val_loss: 0.2112
Epoch 3/50
39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 13ms/step - accuracy: 0.9900 - loss: 0.1745 - val_accuracy: 0.9644 - val_loss: 0.1340
Epoch 4/50
39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 17ms/step - accuracy: 0.9887 - loss: 0.1095 - val_accuracy: 0.9547 - val_loss: 0.1546
Epoch 5/50
39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 23ms/step - accuracy: 0.9875 - loss: 0.0793 - val_accuracy: 0.9644 - val_loss: 0.1513
Epoch 6/50
39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 29ms/step - accuracy: 0.9894 - loss: 0.0665 - val_accuracy: 0.9612 - val_loss: 0.1556
Epoch 7/50
39/39 ━━━━━━━━━━━━━━━━━━━━ 2s 35ms/step - accuracy: 0.9828 - loss: 0.0693 - val_accuracy: 0.9644 - val_loss: 0.1527
Epoch 8/50
39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 24ms/step - accuracy: 0.9903 - loss: 0.0473 - val_accuracy: 0.9644 - val_loss: 0.1628
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step - accuracy: 0.9696 - loss: 0.1262 
Test Loss: 0.13402891159057617
Test Accuracy: 0.9644013047218323
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 25ms/step
Predicted labels: [4 1 4 1 3 2 4 4 3 3 1 4 4 0 0 3 0 2 1 2 1 1 1 0 1 1 3 2 2 1 3 4 2 4 0 2 4
 1 4 1 1 2 0 1 4 0 2 4 3 2 1 1 3 3 0 1 1 1 3 3 2 2 1 0 0 1 0 2 4 2 0 0 1 4
 0 0 3 3 4 3 3 0 3 0 0 4 2 1 2 3 4 3 4 4 4 1 4 3 2 2 3 4 1 4 4 2 4 2 3 1 1
 1 3 3 1 1 1 1 3 4 0 1 0 4 2 4 4 2 3 1 1 1 1 1 0 4 3 3 3 3 4 0 3 0 0 0 2 4
 2 3 4 0 1 3 2 0 3 4 1 2 4 4 1 2 4 1 0 4 2 3 3 3 1 4 1 0 3 4 4 2 3 2 3 4 3
 1 2 1 0 3 3 3 4 2 3 4 1 2 1 2 1 1 0 0 3 3 1 0 3 2 2 0 0 2 0 2 4 1 4 1 4 0
 3 4 3 2 4 1 3 1 1 0 0 4 2 2 3 4 1 0 1 4 4 2 4 1 4 1 3 1 0 2 1 3 4 0 2 3 2
 2 1 1 1 0 4 4 4 1 3 4 3 1 1 3 1 2 2 1 0 3 1 4 0 1 3 0 3 2 4 1 4 4 4 2 3 0
 0 1 4 0 2 4 1 2 3 2 4 2 2]
True labels: [4 1 4 1 3 2 4 4 3 3 1 4 4 0 0 3 0 2 1 2 1 1 1 0 1 1 3 2 2 1 3 4 2 4 0 2 4
 1 4 1 1 2 0 1 4 0 2 4 3 2 1 1 3 3 0 1 1 1 3 3 2 2 1 0 0 1 0 2 4 2 2 0 1 4
 0 3 3 3 4 3 3 0 3 0 0 4 2 1 2 3 4 3 4 4 4 1 4 3 2 2 0 4 1 4 4 2 4 2 3 1 1
 1 3 3 1 1 1 1 3 4 0 1 0 4 2 4 4 2 3 3 1 1 3 1 3 4 3 3 3 3 4 2 3 0 0 0 2 4
 2 3 4 0 1 3 2 0 3 4 1 3 0 4 1 2 4 1 0 4 2 3 3 3 1 4 1 0 3 4 4 2 3 2 3 4 3
 1 2 1 0 3 3 3 4 2 0 4 1 2 1 2 1 1 0 0 3 3 1 0 3 2 2 0 0 2 0 2 4 1 4 1 4 0
 3 4 3 2 4 1 3 1 1 0 0 4 2 2 3 4 1 0 1 4 4 2 4 1 4 1 3 1 0 2 1 3 4 0 2 3 2
 2 1 1 1 0 4 4 4 1 3 4 3 1 1 3 1 2 2 1 3 3 1 4 0 1 3 0 3 2 4 1 4 4 4 2 3 0
 0 1 4 0 2 4 1 2 3 2 4 2 2]
In [ ]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Generate the confusion matrix
cm = confusion_matrix(y_test_encoded, y_pred)

# Plot the confusion matrix
plt.figure(figsize=(10, 7))
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=label_encoder.classes_)
disp.plot(cmap=plt.cm.Greens, ax=plt.gca())
plt.title('Confusion Matrix')
plt.show()

Classification Report for Test Set for Hypertuned LSTM Classifier

In [ ]:
from sklearn.metrics import classification_report

# Make predictions on the test set
y_pred_prob = best_model.predict(X_test_reshaped)
y_pred = np.argmax(y_pred_prob, axis=1)  # Get the predicted class labels

# Decode the predicted labels back to original
y_pred_decoded = label_encoder.inverse_transform(y_pred)

# True labels (already encoded)
y_test_decoded = label_encoder.inverse_transform(y_test_encoded)

# Print classification report
print("Classification Report:")
print(classification_report(y_test_decoded, y_pred_decoded))
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step
Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.94      0.92        47
           1       0.97      1.00      0.99        73
           2       0.98      0.96      0.97        56
           3       0.97      0.91      0.94        66
           4       0.99      1.00      0.99        67

    accuracy                           0.96       309
   macro avg       0.96      0.96      0.96       309
weighted avg       0.96      0.96      0.96       309

Classification Report for Training Set for Hypertuned LSTM Classifier

In [ ]:
# Make predictions on the training set
y_train_pred_prob = best_model.predict(X_train_reshaped)
y_train_pred = np.argmax(y_train_pred_prob, axis=1)  # Get the predicted class labels

# Decode the predicted labels back to original
y_train_pred_decoded = label_encoder.inverse_transform(y_train_pred)

# True labels for the training set
y_train_decoded = label_encoder.inverse_transform(y_train_encoded)

# Print classification report for the training set
print("Train Classification Report:")
print(classification_report(y_train_decoded, y_train_pred_decoded))

# Print confusion matrix for training set
print("Train Confusion Matrix:")
print(confusion_matrix(y_train_decoded, y_train_pred_decoded))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
Train Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      0.99       262
           1       0.97      1.00      0.98       236
           2       1.00      0.99      1.00       253
           3       1.00      0.96      0.98       243
           4       1.00      1.00      1.00       242

    accuracy                           0.99      1236
   macro avg       0.99      0.99      0.99      1236
weighted avg       0.99      0.99      0.99      1236

Train Confusion Matrix:
[[262   0   0   0   0]
 [  0 236   0   0   0]
 [  2   0 251   0   0]
 [  1   8   0 234   0]
 [  0   0   0   0 242]]

Choose the best performing classifier and pickle it.

Conclusion:¶¶

  • The LSTM Hypertuned model performed exceptionally well, with nearly 100% accuracy for training and 97% for validation datasets.

  • Train vs. Test recall across all the classes remains same except for class 0

  • The consistent behavior of the training and validation accuracy/loss shows that the model is well-tuned and generalizes well to unseen data.

  • The early stopping criteria seem to have helped in stopping the training process at an optimal point, avoiding overfitting while achieving high accuracy.

In [ ]:
import pickle

# Save the model to a file
filename = 'LSTM_Hypertuned_Model.sav'
pickle.dump(best_model, open(filename, 'wb'))

Comparative Analysis (Final model vs. Milestone 1 results)¶

  • Accuracy and Recall: The LSTM model shows a slight dip in validation accuracy and recall compared to the XG Boost test metrics. However, the LSTM model’s performance is still very high.

  • Generalization: Both approaches seem to generalize well, though the LSTM’s slightly lower validation recall for class 0 might suggest a bit more focus on handling that specific class.

  • Complexity and Interpretability: LSTM models, being deep learning-based, generally come with increased complexity and reduced interpretability compared to tree-based methods like Gradient Boosting and XG Boost.

Improvement Analysis

  • Benchmark Improvement: While the LSTM did not surpass the Gradient Boosting or XG Boost models in all metrics, it demonstrated comparable performance with potential benefits in handling sequential or time-series data more effectively (if that's relevant to your application).

  • Tuning and Early Stopping: The LSTM benefits from hypertuning and early stopping, which seem to have optimized its training process to prevent overfitting effectively.

Conclusion

The final solution with the hypertuned LSTM model highlights the strength of deep learning in achieving high accuracy and maintaining strong generalization, particularly in tasks involving sequential data.

While it may not have outperformed models like Gradient Boosting or XGBoost in some metrics, the LSTM model offers a powerful and reliable alternative, especially when handling data with inherent sequence dependencies, as seen in this industrial safety context.

LSTM Sequential Embedding layer¶

In [ ]:
df_LSTM = pd.read_csv('/content/drive/MyDrive/AIML_Capstone_Project/df_preprocess_14122024.csv')
In [ ]:
df_LSTM.head()
Out[ ]:
Country City Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Day Weekday WeekofYear Weekend Season Description tokenized_words
0 Country_01 Local_01 Mining 1 4 Male Contractor Pressed 1 Friday 53 0 Summer remove drill rod jumbo maintenance supervisor ... ['remove', 'drill', 'rod', 'jumbo', 'maintenan...
1 Country_02 Local_02 Mining 1 4 Male Employee Pressurized Systems 2 Saturday 53 1 Summer activation sodium sulphide pump piping uncoupl... ['activation', 'sodium', 'sulphide', 'pump', '...
2 Country_01 Local_03 Mining 1 3 Male Contractor (Remote) Manual Tools 6 Wednesday 1 0 Summer sub station milpo locate level collaborator ex... ['sub', 'station', 'milpo', 'locate', 'level',...
3 Country_01 Local_04 Mining 1 1 Male Contractor Others 8 Friday 1 0 Summer approximately nv personnel begin task unlock s... ['approximately', 'nv', 'personnel', 'begin', ...
4 Country_01 Local_04 Mining 4 4 Male Contractor Others 10 Sunday 1 1 Summer approximately circumstance mechanic anthony gr... ['approximately', 'circumstance', 'mechanic', ...

Glove Embedding Arcitecutre for LSTM

In [ ]:
def generate_glove_sequential_embeddings(df_LSTM):
    df_sequential = df_LSTM.copy()

    # Load GloVe model
    def load_glove_model(glove_file):
        embedding_dict = {}
        with open(glove_file, 'r', encoding="utf8") as f:
            for line in f:
                values = line.split()
                word = values[0]
                vector = np.asarray(values[1:], "float32")
                embedding_dict[word] = vector
        return embedding_dict

    glove_file = '/content/drive/MyDrive/AIML_Capstone_Project/glove.6B/glove.6B.300d.txt'
    glove_embeddings = load_glove_model(glove_file)

    # Function to get GloVe embeddings for each tokenized word sequence
    def get_glove_embeddings(tokenized_words, embedding_dict, embedding_dim=300):
        return [embedding_dict.get(word, np.zeros(embedding_dim)) for word in tokenized_words]

    # Generate GloVe embeddings as sequential data
    glove_embeddings_series = df_sequential['tokenized_words'].apply(
        lambda words: get_glove_embeddings(words, glove_embeddings)
    )

    # Combine the sequential embeddings into a DataFrame
    Glove_df_sequential = pd.concat(
        [df_sequential.drop(columns=['tokenized_words']),
         pd.DataFrame(glove_embeddings_series, columns=['GloVe_Sequence'])],
        axis=1
    )

    return Glove_df_sequential
In [ ]:
df_LSTM.head()
Out[ ]:
Country City Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Day Weekday WeekofYear Weekend Season Description tokenized_words
0 Country_01 Local_01 Mining 1 4 Male Contractor Pressed 1 Friday 53 0 Summer remove drill rod jumbo maintenance supervisor ... ['remove', 'drill', 'rod', 'jumbo', 'maintenan...
1 Country_02 Local_02 Mining 1 4 Male Employee Pressurized Systems 2 Saturday 53 1 Summer activation sodium sulphide pump piping uncoupl... ['activation', 'sodium', 'sulphide', 'pump', '...
2 Country_01 Local_03 Mining 1 3 Male Contractor (Remote) Manual Tools 6 Wednesday 1 0 Summer sub station milpo locate level collaborator ex... ['sub', 'station', 'milpo', 'locate', 'level',...
3 Country_01 Local_04 Mining 1 1 Male Contractor Others 8 Friday 1 0 Summer approximately nv personnel begin task unlock s... ['approximately', 'nv', 'personnel', 'begin', ...
4 Country_01 Local_04 Mining 4 4 Male Contractor Others 10 Sunday 1 1 Summer approximately circumstance mechanic anthony gr... ['approximately', 'circumstance', 'mechanic', ...
In [ ]:
Glove_df_sequential.head()
Out[ ]:
Country City Industry Sector Accident Level Gender Employee type Critical Risk Weekday WeekofYear Weekend Season GloVe_Sequence
0 Country_01 Local_01 Mining 0 Male Contractor Pressed Friday 53 0 Summer NaN
1 Country_02 Local_02 Mining 0 Male Employee Pressurized Systems Saturday 53 1 Summer NaN
2 Country_01 Local_03 Mining 0 Male Contractor (Remote) Manual Tools Wednesday 1 0 Summer NaN
3 Country_01 Local_04 Mining 0 Male Contractor Others Friday 1 0 Summer NaN
4 Country_01 Local_04 Mining 3 Male Contractor Others Sunday 1 1 Summer NaN

Label encode Accident level and Potential Accident Level in Glove_Sequential Dataframes

In [ ]:
from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
#label_encoder = LabelEncoder()

# Encode 'Accident Level' and 'Potential Accident Level' in Glove_df
#Glove_df_sequential['Accident Level'] = label_encoder.fit_transform(Glove_df_sequential['Accident Level'])
#Glove_df_sequential['Potential Accident Level'] = label_encoder.fit_transform(Glove_df_sequential['Potential Accident Level'])
In [ ]:
# Columns to drop
#columns_to_drop = ['Day', 'Potential Accident Level']

# Drop columns from each DataFrame
#Glove_df_sequential = Glove_df_sequential.drop(columns_to_drop, axis=1)
In [ ]:
# Calculate target variable distribution for each DataFrame
glove_target_dist = Glove_df_sequential['Accident Level'].value_counts(normalize=False)
# Create a DataFrame to display the distributions
target_distribution_df = pd.DataFrame({
    'Glove': glove_target_dist,
   })

# Print the DataFrame
target_distribution_df
Out[ ]:
Glove
Accident Level
0 309
1 40
2 31
3 30
4 8

Observations: Target Variable Distribution:

Across all three embedding methods (GloVe, TF-IDF, Word2Vec), the distribution of the target variable "Accident Level" remains consistent. This indicates that the embedding process itself doesn't significantly alter the representation of the target variable. The majority of instances fall under a specific "Accident Level" (likely the most common type of accident), highlighting the imbalanced nature of the dataset. Implications for Modeling:

The imbalanced target distribution suggests the need for addressing class imbalance during model training. Techniques like oversampling, undersampling, or using weighted loss functions might be necessary to improve model performance on minority classes. Careful evaluation metrics (precision, recall, F1-score) should be used to assess model performance on all classes, not just the majority class.

In [ ]:
!pip install imblearn
Requirement already satisfied: imblearn in /usr/local/lib/python3.10/dist-packages (0.0)
Requirement already satisfied: imbalanced-learn in /usr/local/lib/python3.10/dist-packages (from imblearn) (0.12.4)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.26.4)
Requirement already satisfied: scipy>=1.5.0 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.13.1)
Requirement already satisfied: scikit-learn>=1.0.2 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.5.2)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.4.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (3.5.0)
In [ ]:
# Balance 'Accident Level' using SMOTE. for all the 3 dataframes.
# Converting categorical features to numerical using one-hot encoding

import pandas as pd
from imblearn.over_sampling import SMOTE

# Function to balance data and one-hot encode categorical features
def balance_and_encode(df):
  # Separate features and target variable
  X = df.drop('Accident Level', axis=1)
  y = df['Accident Level']

  # One-hot encode categorical features (if any)
  categorical_features = X.select_dtypes(include=['object']).columns
  if categorical_features.any():
    X_encoded = pd.get_dummies(X, columns=categorical_features, dtype=int, drop_first=True)
  else:
    X_encoded = X

  # One-hot encode 'DayOfWeek'
  #X_encoded = pd.get_dummies(X_encoded, columns=['DayOfWeek'], dtype=int, drop_first=True)

  # Apply SMOTE to balance the dataset
  smote = SMOTE(random_state=42)
  X_resampled, y_resampled = smote.fit_resample(X_encoded, y)

  # Combine balanced features and target
  balanced_df = pd.concat([X_resampled, y_resampled], axis=1)

  return balanced_df

# Apply the function to each DataFrame
Glove_df_Bal = balance_and_encode(Glove_df_sequential)


# Calculate balanced target variable distribution for each DataFrame
glove_balanced_dist = Glove_df_Bal['Accident Level'].value_counts(normalize=False)


# Create a DataFrame to display the balanced distributions
Balanced_Distribution_df = pd.DataFrame({
    'Glove (Balanced)': glove_balanced_dist,
   })

# Print the DataFrame
Balanced_Distribution_df
Out[ ]:
Glove (Balanced)
Accident Level
0 309
3 309
2 309
1 309
4 309
In [ ]:
Glove_df_Bal
Out[ ]:
WeekofYear Weekend Country_Country_02 Country_Country_03 City_Local_02 City_Local_03 City_Local_04 City_Local_05 City_Local_06 City_Local_07 ... Weekday_Monday Weekday_Saturday Weekday_Sunday Weekday_Thursday Weekday_Tuesday Weekday_Wednesday Season_Spring Season_Summer Season_Winter Accident Level
0 53 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
1 53 1 1 0 1 0 0 0 0 0 ... 0 1 0 0 0 0 0 1 0 0
2 1 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 1 0 1 0 0
3 1 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
4 1 1 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 1 0 3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1540 7 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 4
1541 16 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 4
1542 9 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 4
1543 6 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 4
1544 11 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 4

1545 rows × 62 columns

In [ ]:
# Export to CSV
Glove_df_Bal.to_csv('/content/drive/MyDrive/AIML_Capstone_Project/Final_NLP_Glove_df_Bal.csv', index=False)
In [ ]:
Glove_df_sequential = generate_glove_sequential_embeddings(df_preprocess2)
In [ ]:
df_LSTM_1 = pd.read_csv('/content/drive/MyDrive/AIML_Capstone_Project/Final_NLP_Glove_df_Bal_14122024.csv')
In [ ]:
# Encode labels in column 'Accident Level'.
y_text = LabelEncoder().fit_transform(y_text)
In [ ]:
# Divide our data into testing and training sets:
from sklearn.model_selection import train_test_split
X_text_train, X_text_test, y_text_train, y_text_test = train_test_split(X_text, y_text, test_size = 0.20, random_state = 1, stratify = y_text)

print('X_text_train shape : ({0})'.format(X_text_train.shape[0]))
print('y_text_train shape : ({0},)'.format(y_text_train.shape[0]))
print('X_text_test shape : ({0})'.format(X_text_test.shape[0]))
print('y_text_test shape : ({0},)'.format(y_text_test.shape[0]))
X_text_train shape : (1236)
y_text_train shape : (1236,)
X_text_test shape : (309)
y_text_test shape : (309,)
In [ ]:
from tensorflow.keras.utils import to_categorical

# Convert both the training and test labels into one-hot encoded vectors:
y_text_train = to_categorical(y_text_train, num_classes=5)  # Ensure the number of classes is specified
y_text_test = to_categorical(y_text_test, num_classes=5)  # Ensure the number of classes is specified
In [ ]:
from tensorflow.keras.preprocessing.text import Tokenizer

# Ensure that X_text_train and X_text_test are lists of strings
X_text_train = [str(text) for text in X_text_train]
X_text_test = [str(text) for text in X_text_test]

# Initialize the tokenizer and fit it on the training data
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X_text_train)

# Convert the text data into sequences of numeric indexes
X_text_train = tokenizer.texts_to_sequences(X_text_train)
X_text_test = tokenizer.texts_to_sequences(X_text_test)
In [ ]:
# Installing additional Libraries
from tensorflow.keras.layers import Input, Embedding, LSTM, Bidirectional, GlobalMaxPool1D, Dropout, Dense, Concatenate, BatchNormalization
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.regularizers import l2
from tensorflow.keras.constraints import unit_norm
# Keras pre-processing
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
In [ ]:
# Sentences can have different lengths, and therefore the sequences returned by the Tokenizer class also consist of variable lengths.
# We need to pad the our sequences using the max length.
vocab_size = len(tokenizer.word_index) + 1
print("vocab_size:", vocab_size)

maxlen = 100

X_text_train = pad_sequences(X_text_train, padding='post', maxlen=maxlen)
X_text_test = pad_sequences(X_text_test, padding='post', maxlen=maxlen)
vocab_size: 2169
In [ ]:
embedding_size = 300
embeddings_dictionary = dict()

# Load GloVe model and generate GloVe embeddings
glove_file_path = '/content/drive/MyDrive/AIML_Capstone_Project/glove.6B/glove.6B.300d.txt'

# Open the GloVe file
with open(glove_file_path, encoding='utf-8') as glove_file:
    for line in glove_file:
        records = line.split()
        word = records[0]
        vector_dimensions = np.asarray(records[1:], dtype='float32')
        embeddings_dictionary[word] = vector_dimensions

# Create an embedding matrix
embedding_matrix = np.zeros((vocab_size, embedding_size))

for word, index in tokenizer.word_index.items():
    if index < vocab_size:  # Ensure the index does not exceed the vocabulary size
        embedding_vector = embeddings_dictionary.get(word)
        if embedding_vector is not None:
            embedding_matrix[index] = embedding_vector

# Check the number of embeddings loaded
print(f"Number of embeddings loaded: {len(embeddings_dictionary)}")
Number of embeddings loaded: 400000

Building Simple LSTM Neural Network - Embedded

In [ ]:
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, LSTM, Bidirectional, GlobalMaxPool1D, Dropout, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import SGD
import numpy as np
import random

def reset_random_seeds():
    np.random.seed(7)
    random.seed(7)
    tf.random.set_seed(7)

# Call the reset function
reset_random_seeds()

# Define your model (as provided in your code)
deep_inputs = Input(shape=(maxlen,))
embedding_layer = Embedding(vocab_size, embedding_size, weights=[embedding_matrix], trainable=False)(deep_inputs)

LSTM_Layer_1 = Bidirectional(LSTM(128, return_sequences=True))(embedding_layer)
max_pool_layer_1 = GlobalMaxPool1D()(LSTM_Layer_1)
drop_out_layer_1 = Dropout(0.5)(max_pool_layer_1)
dense_layer_1 = Dense(128, activation='relu')(drop_out_layer_1)
drop_out_layer_2 = Dropout(0.5)(dense_layer_1)
dense_layer_2 = Dense(64, activation='relu')(drop_out_layer_2)
drop_out_layer_3 = Dropout(0.5)(dense_layer_2)

dense_layer_3 = Dense(32, activation='relu')(drop_out_layer_3)
drop_out_layer_4 = Dropout(0.5)(dense_layer_3)

dense_layer_4 = Dense(10, activation='relu')(drop_out_layer_4)
drop_out_layer_5 = Dropout(0.5)(dense_layer_4)

dense_layer_5 = Dense(5, activation='softmax')(drop_out_layer_5)

model = Model(inputs=deep_inputs, outputs=dense_layer_5)

opt = SGD(learning_rate=0.001, momentum=0.9)  # Updated to use 'learning_rate' instead of 'lr'
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['acc'])

Model Summary of LSTM Embedded

In [ ]:
print(model.summary())
Model: "functional"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ input_layer (InputLayer)             │ (None, 100)                 │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ embedding (Embedding)                │ (None, 100, 300)            │         650,700 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ bidirectional (Bidirectional)        │ (None, 100, 256)            │         439,296 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ global_max_pooling1d                 │ (None, 256)                 │               0 │
│ (GlobalMaxPooling1D)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout (Dropout)                    │ (None, 256)                 │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense (Dense)                        │ (None, 128)                 │          32,896 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_1 (Dropout)                  │ (None, 128)                 │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_1 (Dense)                      │ (None, 64)                  │           8,256 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_2 (Dropout)                  │ (None, 64)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_2 (Dense)                      │ (None, 32)                  │           2,080 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_3 (Dropout)                  │ (None, 32)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_3 (Dense)                      │ (None, 10)                  │             330 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_4 (Dropout)                  │ (None, 10)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_4 (Dense)                      │ (None, 5)                   │              55 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 1,133,613 (4.32 MB)
 Trainable params: 482,913 (1.84 MB)
 Non-trainable params: 650,700 (2.48 MB)
None

Plotting the Model Summary - LSTM Embedded

In [ ]:
from keras.utils import plot_model
from tensorflow.keras.utils import to_categorical


plot_model(model, to_file='model_plot1.png', show_shapes=True, show_dtype=True, show_layer_names=True)
Out[ ]:
In [ ]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from tensorflow.keras.callbacks import Callback

class Metrics(Callback):
    def __init__(self, validation_data, target_type='multi_label'):
        super(Metrics, self).__init__()
        self.validation_data = validation_data
        self.target_type = target_type

    def on_epoch_end(self, epoch, logs=None):
        # Extract validation data and labels
        val_data, val_labels, _ = self.validation_data

        # Predict the output using the model
        val_predictions = self.model.predict(val_data)

        # For multi-label classification, threshold predictions at 0.5
        if self.target_type == 'multi_label':
            val_predictions = (val_predictions > 0.5).astype(int)

            # Calculate metrics
            val_accuracy = accuracy_score(val_labels, val_predictions)
            val_f1 = f1_score(val_labels, val_predictions, average='macro')
            val_precision = precision_score(val_labels, val_predictions, average='macro')
            val_recall = recall_score(val_labels, val_predictions, average='macro')
        else:
            val_predictions = val_predictions.argmax(axis=1)
            val_labels = val_labels.argmax(axis=1)

            val_accuracy = accuracy_score(val_labels, val_predictions)
            val_f1 = f1_score(val_labels, val_predictions, average='macro')
            val_precision = precision_score(val_labels, val_predictions, average='macro')
            val_recall = recall_score(val_labels, val_predictions, average='macro')

        # Print the metrics for the validation set
        print(f" - val_accuracy: {val_accuracy:.4f} - val_f1: {val_f1:.4f} - val_precision: {val_precision:.4f} - val_recall: {val_recall:.4f}")
In [ ]:
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
# Use earlystopping
# callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=5, min_delta=0.001)
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=7, min_delta=1E-3)
rlrp = ReduceLROnPlateau(monitor='val_loss', factor=0.0001, patience=5, min_delta=1E-4)

target_type = 'multi_label'
metrics = Metrics(validation_data=(X_text_train, y_text_train, target_type))

# fit the keras model on the dataset
training_history = model.fit(X_text_train, y_text_train, epochs=100, batch_size=8, verbose=1, validation_data=(X_text_test, y_text_test), callbacks=[rlrp, metrics])
Epoch 1/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 9ms/step
 - val_accuracy: 0.0000 - val_f1: 0.0000 - val_precision: 0.0000 - val_recall: 0.0000
155/155 ━━━━━━━━━━━━━━━━━━━━ 8s 20ms/step - acc: 0.2070 - loss: 1.6795 - val_acc: 0.2006 - val_loss: 1.6095 - learning_rate: 0.0010
Epoch 2/100
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.0000 - val_f1: 0.0000 - val_precision: 0.0000 - val_recall: 0.0000
155/155 ━━━━━━━━━━━━━━━━━━━━ 8s 22ms/step - acc: 0.2013 - loss: 1.6213 - val_acc: 0.2006 - val_loss: 1.6092 - learning_rate: 0.0010
Epoch 3/100
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.0000 - val_f1: 0.0000 - val_precision: 0.0000 - val_recall: 0.0000
155/155 ━━━━━━━━━━━━━━━━━━━━ 4s 16ms/step - acc: 0.2281 - loss: 1.6079 - val_acc: 0.2006 - val_loss: 1.6099 - learning_rate: 0.0010
Epoch 4/100
 11/155 ━━━━━━━━━━━━━━━━━━━━ 1s 12ms/step - acc: 0.2851 - loss: 1.5988
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.0000 - val_f1: 0.0000 - val_precision: 0.0000 - val_recall: 0.0000
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.2568 - loss: 1.5908 - val_acc: 0.2006 - val_loss: 1.5929 - learning_rate: 0.0010
Epoch 5/100
  1/155 ━━━━━━━━━━━━━━━━━━━━ 21s 136ms/step - acc: 0.1250 - loss: 1.5667
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.0000 - val_f1: 0.0000 - val_precision: 0.0000 - val_recall: 0.0000
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 14ms/step - acc: 0.2535 - loss: 1.5589 - val_acc: 0.4304 - val_loss: 1.5557 - learning_rate: 0.0010
Epoch 6/100
 11/155 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step - acc: 0.2239 - loss: 1.5811
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.0000 - val_f1: 0.0000 - val_precision: 0.0000 - val_recall: 0.0000
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 17ms/step - acc: 0.2531 - loss: 1.5528 - val_acc: 0.4078 - val_loss: 1.4733 - learning_rate: 0.0010
Epoch 7/100
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.0000 - val_f1: 0.0000 - val_precision: 0.0000 - val_recall: 0.0000
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 16ms/step - acc: 0.3618 - loss: 1.4733 - val_acc: 0.5825 - val_loss: 1.3528 - learning_rate: 0.0010
Epoch 8/100
 10/155 ━━━━━━━━━━━━━━━━━━━━ 1s 13ms/step - acc: 0.4157 - loss: 1.3727
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.1788 - val_f1: 0.1885 - val_precision: 0.2000 - val_recall: 0.1782
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.3911 - loss: 1.3915 - val_acc: 0.5696 - val_loss: 1.2485 - learning_rate: 0.0010
Epoch 9/100
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.1788 - val_f1: 0.1885 - val_precision: 0.2000 - val_recall: 0.1782
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 16ms/step - acc: 0.4089 - loss: 1.3164 - val_acc: 0.5728 - val_loss: 1.1045 - learning_rate: 0.0010
Epoch 10/100
 11/155 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step - acc: 0.4458 - loss: 1.3091
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.3600 - val_f1: 0.3787 - val_precision: 0.4000 - val_recall: 0.3596
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 17ms/step - acc: 0.4389 - loss: 1.2594 - val_acc: 0.5793 - val_loss: 1.0125 - learning_rate: 0.0010
Epoch 11/100
  9/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.5209 - loss: 1.2786
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step
 - val_accuracy: 0.3600 - val_f1: 0.3787 - val_precision: 0.4000 - val_recall: 0.3596
155/155 ━━━━━━━━━━━━━━━━━━━━ 4s 24ms/step - acc: 0.4739 - loss: 1.1876 - val_acc: 0.5793 - val_loss: 0.9588 - learning_rate: 0.0010
Epoch 12/100
 11/155 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step - acc: 0.4344 - loss: 1.1856
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.3600 - val_f1: 0.3787 - val_precision: 0.4000 - val_recall: 0.3596
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.4860 - loss: 1.1401 - val_acc: 0.6181 - val_loss: 0.8820 - learning_rate: 0.0010
Epoch 13/100
 11/155 ━━━━━━━━━━━━━━━━━━━━ 1s 12ms/step - acc: 0.4818 - loss: 1.3453
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.3600 - val_f1: 0.3787 - val_precision: 0.4000 - val_recall: 0.3596
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.5108 - loss: 1.1296 - val_acc: 0.7249 - val_loss: 0.8044 - learning_rate: 0.0010
Epoch 14/100
  1/155 ━━━━━━━━━━━━━━━━━━━━ 22s 146ms/step - acc: 0.3750 - loss: 1.1519
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.3600 - val_f1: 0.3787 - val_precision: 0.4000 - val_recall: 0.3596
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 16ms/step - acc: 0.5290 - loss: 1.0259 - val_acc: 0.7476 - val_loss: 0.7629 - learning_rate: 0.0010
Epoch 15/100
  6/155 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step - acc: 0.5806 - loss: 0.9913  
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.3600 - val_f1: 0.3787 - val_precision: 0.4000 - val_recall: 0.3596
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.5646 - loss: 0.9683 - val_acc: 0.7573 - val_loss: 0.7134 - learning_rate: 0.0010
Epoch 16/100
  1/155 ━━━━━━━━━━━━━━━━━━━━ 25s 165ms/step - acc: 0.7500 - loss: 0.9643
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step
 - val_accuracy: 0.3600 - val_f1: 0.3787 - val_precision: 0.4000 - val_recall: 0.3596
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 21ms/step - acc: 0.5836 - loss: 0.9986 - val_acc: 0.7476 - val_loss: 0.6901 - learning_rate: 0.0010
Epoch 17/100
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.3600 - val_f1: 0.3787 - val_precision: 0.4000 - val_recall: 0.3596
155/155 ━━━━━━━━━━━━━━━━━━━━ 4s 16ms/step - acc: 0.6000 - loss: 0.9376 - val_acc: 0.7638 - val_loss: 0.6553 - learning_rate: 0.0010
Epoch 18/100
 10/155 ━━━━━━━━━━━━━━━━━━━━ 1s 12ms/step - acc: 0.6200 - loss: 0.9570
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.3600 - val_f1: 0.3787 - val_precision: 0.4000 - val_recall: 0.3596
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.6118 - loss: 0.8895 - val_acc: 0.7638 - val_loss: 0.6269 - learning_rate: 0.0010
Epoch 19/100
  1/155 ━━━━━━━━━━━━━━━━━━━━ 22s 147ms/step - acc: 0.3750 - loss: 1.0225
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.3600 - val_f1: 0.3787 - val_precision: 0.4000 - val_recall: 0.3596
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.6193 - loss: 0.8596 - val_acc: 0.7638 - val_loss: 0.6006 - learning_rate: 0.0010
Epoch 20/100
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step
 - val_accuracy: 0.3600 - val_f1: 0.3787 - val_precision: 0.4000 - val_recall: 0.3596
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 18ms/step - acc: 0.6253 - loss: 0.8630 - val_acc: 0.7476 - val_loss: 0.5767 - learning_rate: 0.0010
Epoch 21/100
  6/155 ━━━━━━━━━━━━━━━━━━━━ 3s 22ms/step - acc: 0.6021 - loss: 0.8335
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step
 - val_accuracy: 0.5340 - val_f1: 0.5649 - val_precision: 0.6000 - val_recall: 0.5337
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 21ms/step - acc: 0.6594 - loss: 0.7966 - val_acc: 0.7476 - val_loss: 0.5603 - learning_rate: 0.0010
Epoch 22/100
 11/155 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step - acc: 0.6140 - loss: 0.8716
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.5348 - val_f1: 0.5665 - val_precision: 0.6667 - val_recall: 0.5345
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.6564 - loss: 0.7980 - val_acc: 0.7443 - val_loss: 0.5435 - learning_rate: 0.0010
Epoch 23/100
  1/155 ━━━━━━━━━━━━━━━━━━━━ 27s 177ms/step - acc: 0.2500 - loss: 1.1656
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.5574 - val_f1: 0.5193 - val_precision: 0.5504 - val_recall: 0.5572
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 15ms/step - acc: 0.6589 - loss: 0.7921 - val_acc: 0.7638 - val_loss: 0.5417 - learning_rate: 0.0010
Epoch 24/100
  1/155 ━━━━━━━━━━━━━━━━━━━━ 25s 164ms/step - acc: 0.7500 - loss: 0.7520
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.5364 - val_f1: 0.5054 - val_precision: 0.5294 - val_recall: 0.5361
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 15ms/step - acc: 0.6696 - loss: 0.7842 - val_acc: 0.7476 - val_loss: 0.5270 - learning_rate: 0.0010
Epoch 25/100
  1/155 ━━━━━━━━━━━━━━━━━━━━ 23s 154ms/step - acc: 0.3750 - loss: 0.9241
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.5906 - val_f1: 0.6435 - val_precision: 0.7284 - val_recall: 0.5904
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 16ms/step - acc: 0.6549 - loss: 0.7941 - val_acc: 0.7476 - val_loss: 0.5181 - learning_rate: 0.0010
Epoch 26/100
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9134 - val_f1: 0.9316 - val_precision: 0.9560 - val_recall: 0.9134
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 16ms/step - acc: 0.6688 - loss: 0.7685 - val_acc: 0.9417 - val_loss: 0.5146 - learning_rate: 0.0010
Epoch 27/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9296 - val_f1: 0.9349 - val_precision: 0.9511 - val_recall: 0.9296
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.6983 - loss: 0.7407 - val_acc: 0.9417 - val_loss: 0.4999 - learning_rate: 0.0010
Epoch 28/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9296 - val_f1: 0.9349 - val_precision: 0.9511 - val_recall: 0.9296
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.7347 - loss: 0.7044 - val_acc: 0.9417 - val_loss: 0.4866 - learning_rate: 0.0010
Epoch 29/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.7338 - val_f1: 0.7364 - val_precision: 0.7502 - val_recall: 0.7337
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 15ms/step - acc: 0.7306 - loss: 0.6852 - val_acc: 0.9417 - val_loss: 0.4595 - learning_rate: 0.0010
Epoch 30/100
  7/155 ━━━━━━━━━━━━━━━━━━━━ 2s 20ms/step - acc: 0.6334 - loss: 0.7949
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step
 - val_accuracy: 0.9296 - val_f1: 0.9349 - val_precision: 0.9511 - val_recall: 0.9296
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 21ms/step - acc: 0.7274 - loss: 0.7006 - val_acc: 0.9417 - val_loss: 0.4387 - learning_rate: 0.0010
Epoch 31/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9296 - val_f1: 0.9346 - val_precision: 0.9506 - val_recall: 0.9296
155/155 ━━━━━━━━━━━━━━━━━━━━ 4s 16ms/step - acc: 0.7254 - loss: 0.7146 - val_acc: 0.9417 - val_loss: 0.4410 - learning_rate: 0.0010
Epoch 32/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9328 - val_f1: 0.9383 - val_precision: 0.9533 - val_recall: 0.9329
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 14ms/step - acc: 0.7440 - loss: 0.7109 - val_acc: 0.9417 - val_loss: 0.4133 - learning_rate: 0.0010
Epoch 33/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9296 - val_f1: 0.9355 - val_precision: 0.9520 - val_recall: 0.9296
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.7596 - loss: 0.6827 - val_acc: 0.9417 - val_loss: 0.3966 - learning_rate: 0.0010
Epoch 34/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step
 - val_accuracy: 0.9328 - val_f1: 0.9379 - val_precision: 0.9529 - val_recall: 0.9329
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 17ms/step - acc: 0.7753 - loss: 0.6654 - val_acc: 0.9385 - val_loss: 0.3857 - learning_rate: 0.0010
Epoch 35/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.9304 - val_f1: 0.9357 - val_precision: 0.9515 - val_recall: 0.9305
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 22ms/step - acc: 0.7714 - loss: 0.6644 - val_acc: 0.9417 - val_loss: 0.3791 - learning_rate: 0.0010
Epoch 36/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9337 - val_f1: 0.9390 - val_precision: 0.9539 - val_recall: 0.9337
155/155 ━━━━━━━━━━━━━━━━━━━━ 4s 16ms/step - acc: 0.8086 - loss: 0.5983 - val_acc: 0.9417 - val_loss: 0.3646 - learning_rate: 0.0010
Epoch 37/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9328 - val_f1: 0.9385 - val_precision: 0.9539 - val_recall: 0.9329
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 15ms/step - acc: 0.7543 - loss: 0.6555 - val_acc: 0.9417 - val_loss: 0.3555 - learning_rate: 0.0010
Epoch 38/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.9337 - val_f1: 0.9402 - val_precision: 0.9558 - val_recall: 0.9337
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 19ms/step - acc: 0.7756 - loss: 0.6447 - val_acc: 0.9417 - val_loss: 0.3588 - learning_rate: 0.0010
Epoch 39/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9312 - val_f1: 0.9371 - val_precision: 0.9529 - val_recall: 0.9313
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 20ms/step - acc: 0.8080 - loss: 0.5977 - val_acc: 0.9417 - val_loss: 0.3535 - learning_rate: 0.0010
Epoch 40/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9328 - val_f1: 0.9385 - val_precision: 0.9539 - val_recall: 0.9329
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.8044 - loss: 0.5738 - val_acc: 0.9417 - val_loss: 0.3435 - learning_rate: 0.0010
Epoch 41/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9345 - val_f1: 0.9400 - val_precision: 0.9549 - val_recall: 0.9345
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.7988 - loss: 0.6019 - val_acc: 0.9417 - val_loss: 0.3335 - learning_rate: 0.0010
Epoch 42/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9337 - val_f1: 0.9415 - val_precision: 0.9577 - val_recall: 0.9337
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 15ms/step - acc: 0.7755 - loss: 0.6275 - val_acc: 0.9417 - val_loss: 0.3230 - learning_rate: 0.0010
Epoch 43/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.9345 - val_f1: 0.9400 - val_precision: 0.9549 - val_recall: 0.9345
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 17ms/step - acc: 0.8002 - loss: 0.5927 - val_acc: 0.9417 - val_loss: 0.3132 - learning_rate: 0.0010
Epoch 44/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9345 - val_f1: 0.9414 - val_precision: 0.9561 - val_recall: 0.9345
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 15ms/step - acc: 0.7905 - loss: 0.6054 - val_acc: 0.9417 - val_loss: 0.3145 - learning_rate: 0.0010
Epoch 45/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9345 - val_f1: 0.9399 - val_precision: 0.9545 - val_recall: 0.9345
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.8230 - loss: 0.5502 - val_acc: 0.9417 - val_loss: 0.3118 - learning_rate: 0.0010
Epoch 46/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9345 - val_f1: 0.9400 - val_precision: 0.9549 - val_recall: 0.9345
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.8217 - loss: 0.5571 - val_acc: 0.9417 - val_loss: 0.3066 - learning_rate: 0.0010
Epoch 47/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step
 - val_accuracy: 0.9320 - val_f1: 0.9365 - val_precision: 0.9507 - val_recall: 0.9321
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 18ms/step - acc: 0.8172 - loss: 0.5678 - val_acc: 0.9417 - val_loss: 0.3017 - learning_rate: 0.0010
Epoch 48/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9345 - val_f1: 0.9405 - val_precision: 0.9552 - val_recall: 0.9345
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 16ms/step - acc: 0.7972 - loss: 0.5867 - val_acc: 0.9417 - val_loss: 0.3001 - learning_rate: 0.0010
Epoch 49/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9361 - val_f1: 0.9420 - val_precision: 0.9555 - val_recall: 0.9361
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.8233 - loss: 0.5351 - val_acc: 0.9417 - val_loss: 0.2948 - learning_rate: 0.0010
Epoch 50/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9345 - val_f1: 0.9400 - val_precision: 0.9549 - val_recall: 0.9345
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.8294 - loss: 0.5471 - val_acc: 0.9417 - val_loss: 0.2968 - learning_rate: 0.0010
Epoch 51/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.9345 - val_f1: 0.9398 - val_precision: 0.9542 - val_recall: 0.9345
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.8399 - loss: 0.5378 - val_acc: 0.9417 - val_loss: 0.2931 - learning_rate: 0.0010
Epoch 52/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.9353 - val_f1: 0.9417 - val_precision: 0.9561 - val_recall: 0.9353
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 20ms/step - acc: 0.8283 - loss: 0.5330 - val_acc: 0.9417 - val_loss: 0.2923 - learning_rate: 0.0010
Epoch 53/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9345 - val_f1: 0.9399 - val_precision: 0.9538 - val_recall: 0.9345
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 16ms/step - acc: 0.8252 - loss: 0.5254 - val_acc: 0.9417 - val_loss: 0.2876 - learning_rate: 0.0010
Epoch 54/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9377 - val_f1: 0.9422 - val_precision: 0.9543 - val_recall: 0.9377
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 16ms/step - acc: 0.8086 - loss: 0.5689 - val_acc: 0.9417 - val_loss: 0.2857 - learning_rate: 0.0010
Epoch 55/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.9361 - val_f1: 0.9440 - val_precision: 0.9596 - val_recall: 0.9361
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 19ms/step - acc: 0.8139 - loss: 0.5778 - val_acc: 0.9417 - val_loss: 0.2878 - learning_rate: 0.0010
Epoch 56/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9345 - val_f1: 0.9375 - val_precision: 0.9497 - val_recall: 0.9345
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 20ms/step - acc: 0.8279 - loss: 0.5278 - val_acc: 0.9417 - val_loss: 0.2845 - learning_rate: 0.0010
Epoch 57/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9353 - val_f1: 0.9428 - val_precision: 0.9577 - val_recall: 0.9353
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.8269 - loss: 0.5232 - val_acc: 0.9417 - val_loss: 0.2828 - learning_rate: 0.0010
Epoch 58/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9385 - val_f1: 0.9434 - val_precision: 0.9552 - val_recall: 0.9385
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.8234 - loss: 0.5356 - val_acc: 0.9417 - val_loss: 0.2793 - learning_rate: 0.0010
Epoch 59/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9481 - val_precision: 0.9593 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 15ms/step - acc: 0.8252 - loss: 0.5100 - val_acc: 0.9417 - val_loss: 0.2745 - learning_rate: 0.0010
Epoch 60/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.9393 - val_f1: 0.9455 - val_precision: 0.9586 - val_recall: 0.9393
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 16ms/step - acc: 0.8305 - loss: 0.5108 - val_acc: 0.9417 - val_loss: 0.2774 - learning_rate: 0.0010
Epoch 61/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step
 - val_accuracy: 0.9409 - val_f1: 0.9478 - val_precision: 0.9608 - val_recall: 0.9410
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 21ms/step - acc: 0.8051 - loss: 0.5319 - val_acc: 0.9417 - val_loss: 0.2702 - learning_rate: 0.0010
Epoch 62/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9377 - val_f1: 0.9434 - val_precision: 0.9564 - val_recall: 0.9377
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 17ms/step - acc: 0.8409 - loss: 0.4757 - val_acc: 0.9417 - val_loss: 0.2719 - learning_rate: 0.0010
Epoch 63/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9385 - val_f1: 0.9445 - val_precision: 0.9575 - val_recall: 0.9385
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 17ms/step - acc: 0.8487 - loss: 0.4800 - val_acc: 0.9417 - val_loss: 0.2731 - learning_rate: 0.0010
Epoch 64/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step
 - val_accuracy: 0.9393 - val_f1: 0.9465 - val_precision: 0.9603 - val_recall: 0.9393
155/155 ━━━━━━━━━━━━━━━━━━━━ 6s 23ms/step - acc: 0.8356 - loss: 0.5097 - val_acc: 0.9417 - val_loss: 0.2685 - learning_rate: 0.0010
Epoch 65/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9409 - val_f1: 0.9444 - val_precision: 0.9549 - val_recall: 0.9409
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 16ms/step - acc: 0.8482 - loss: 0.5235 - val_acc: 0.9417 - val_loss: 0.2672 - learning_rate: 0.0010
Epoch 66/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9385 - val_f1: 0.9441 - val_precision: 0.9569 - val_recall: 0.9385
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.8199 - loss: 0.5218 - val_acc: 0.9417 - val_loss: 0.2658 - learning_rate: 0.0010
Epoch 67/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9434 - val_f1: 0.9482 - val_precision: 0.9596 - val_recall: 0.9434
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.8320 - loss: 0.5044 - val_acc: 0.9417 - val_loss: 0.2647 - learning_rate: 0.0010
Epoch 68/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.9450 - val_f1: 0.9493 - val_precision: 0.9596 - val_recall: 0.9450
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 17ms/step - acc: 0.8425 - loss: 0.4887 - val_acc: 0.9417 - val_loss: 0.2701 - learning_rate: 0.0010
Epoch 69/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step
 - val_accuracy: 0.9442 - val_f1: 0.9497 - val_precision: 0.9608 - val_recall: 0.9442
155/155 ━━━━━━━━━━━━━━━━━━━━ 4s 26ms/step - acc: 0.8032 - loss: 0.5639 - val_acc: 0.9417 - val_loss: 0.2700 - learning_rate: 0.0010
Epoch 70/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9466 - val_f1: 0.9508 - val_precision: 0.9614 - val_recall: 0.9466
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 21ms/step - acc: 0.8319 - loss: 0.5289 - val_acc: 0.9417 - val_loss: 0.2704 - learning_rate: 0.0010
Epoch 71/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9482 - val_f1: 0.9515 - val_precision: 0.9597 - val_recall: 0.9482
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.8493 - loss: 0.4627 - val_acc: 0.9385 - val_loss: 0.2761 - learning_rate: 0.0010
Epoch 72/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9482 - val_f1: 0.9514 - val_precision: 0.9597 - val_recall: 0.9482
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.8364 - loss: 0.5082 - val_acc: 0.9417 - val_loss: 0.2656 - learning_rate: 0.0010
Epoch 73/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9482 - val_f1: 0.9513 - val_precision: 0.9595 - val_recall: 0.9482
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 16ms/step - acc: 0.8392 - loss: 0.4964 - val_acc: 0.9417 - val_loss: 0.2652 - learning_rate: 1.0000e-07
Epoch 74/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step
 - val_accuracy: 0.9482 - val_f1: 0.9513 - val_precision: 0.9595 - val_recall: 0.9482
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 17ms/step - acc: 0.8474 - loss: 0.4579 - val_acc: 0.9417 - val_loss: 0.2652 - learning_rate: 1.0000e-07
Epoch 75/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9482 - val_f1: 0.9513 - val_precision: 0.9595 - val_recall: 0.9482
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 16ms/step - acc: 0.8237 - loss: 0.5031 - val_acc: 0.9417 - val_loss: 0.2653 - learning_rate: 1.0000e-07
Epoch 76/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9482 - val_f1: 0.9513 - val_precision: 0.9595 - val_recall: 0.9482
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.8237 - loss: 0.4964 - val_acc: 0.9417 - val_loss: 0.2653 - learning_rate: 1.0000e-07
Epoch 77/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9482 - val_f1: 0.9513 - val_precision: 0.9595 - val_recall: 0.9482
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.8374 - loss: 0.5190 - val_acc: 0.9417 - val_loss: 0.2653 - learning_rate: 1.0000e-07
Epoch 78/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.9482 - val_f1: 0.9513 - val_precision: 0.9595 - val_recall: 0.9482
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 16ms/step - acc: 0.8369 - loss: 0.5185 - val_acc: 0.9417 - val_loss: 0.2653 - learning_rate: 1.0000e-11
Epoch 79/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.9482 - val_f1: 0.9513 - val_precision: 0.9595 - val_recall: 0.9482
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 22ms/step - acc: 0.8461 - loss: 0.4305 - val_acc: 0.9417 - val_loss: 0.2653 - learning_rate: 1.0000e-11
Epoch 80/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9482 - val_f1: 0.9513 - val_precision: 0.9595 - val_recall: 0.9482
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 16ms/step - acc: 0.8315 - loss: 0.4903 - val_acc: 0.9417 - val_loss: 0.2653 - learning_rate: 1.0000e-11
Epoch 81/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9482 - val_f1: 0.9513 - val_precision: 0.9595 - val_recall: 0.9482
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.8272 - loss: 0.5096 - val_acc: 0.9417 - val_loss: 0.2653 - learning_rate: 1.0000e-11
Epoch 82/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9482 - val_f1: 0.9513 - val_precision: 0.9595 - val_recall: 0.9482
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 16ms/step - acc: 0.8428 - loss: 0.4700 - val_acc: 0.9417 - val_loss: 0.2653 - learning_rate: 1.0000e-11
Epoch 83/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9482 - val_f1: 0.9513 - val_precision: 0.9595 - val_recall: 0.9482
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 16ms/step - acc: 0.8194 - loss: 0.5270 - val_acc: 0.9417 - val_loss: 0.2653 - learning_rate: 1.0000e-15
Epoch 84/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.9482 - val_f1: 0.9513 - val_precision: 0.9595 - val_recall: 0.9482
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 20ms/step - acc: 0.8517 - loss: 0.4489 - val_acc: 0.9417 - val_loss: 0.2653 - learning_rate: 1.0000e-15
Epoch 85/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9482 - val_f1: 0.9513 - val_precision: 0.9595 - val_recall: 0.9482
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 16ms/step - acc: 0.8408 - loss: 0.4793 - val_acc: 0.9417 - val_loss: 0.2653 - learning_rate: 1.0000e-15
Epoch 86/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9482 - val_f1: 0.9513 - val_precision: 0.9595 - val_recall: 0.9482
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 17ms/step - acc: 0.8539 - loss: 0.4527 - val_acc: 0.9417 - val_loss: 0.2653 - learning_rate: 1.0000e-15
Epoch 87/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9482 - val_f1: 0.9513 - val_precision: 0.9595 - val_recall: 0.9482
155/155 ━━━━━━━━━━━━━━━━━━━━ 6s 20ms/step - acc: 0.8355 - loss: 0.4998 - val_acc: 0.9417 - val_loss: 0.2653 - learning_rate: 1.0000e-15
Epoch 88/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9482 - val_f1: 0.9513 - val_precision: 0.9595 - val_recall: 0.9482
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.8457 - loss: 0.4812 - val_acc: 0.9417 - val_loss: 0.2653 - learning_rate: 1.0000e-19
Epoch 89/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9482 - val_f1: 0.9513 - val_precision: 0.9595 - val_recall: 0.9482
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.8280 - loss: 0.4927 - val_acc: 0.9417 - val_loss: 0.2653 - learning_rate: 1.0000e-19
Epoch 90/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9482 - val_f1: 0.9513 - val_precision: 0.9595 - val_recall: 0.9482
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.8548 - loss: 0.4805 - val_acc: 0.9417 - val_loss: 0.2653 - learning_rate: 1.0000e-19
Epoch 91/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step
 - val_accuracy: 0.9482 - val_f1: 0.9513 - val_precision: 0.9595 - val_recall: 0.9482
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 17ms/step - acc: 0.8361 - loss: 0.4944 - val_acc: 0.9417 - val_loss: 0.2653 - learning_rate: 1.0000e-19
Epoch 92/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.9482 - val_f1: 0.9513 - val_precision: 0.9595 - val_recall: 0.9482
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 22ms/step - acc: 0.8350 - loss: 0.4661 - val_acc: 0.9417 - val_loss: 0.2653 - learning_rate: 1.0000e-19
Epoch 93/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9482 - val_f1: 0.9513 - val_precision: 0.9595 - val_recall: 0.9482
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.8487 - loss: 0.4766 - val_acc: 0.9417 - val_loss: 0.2653 - learning_rate: 1.0000e-23
Epoch 94/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9482 - val_f1: 0.9513 - val_precision: 0.9595 - val_recall: 0.9482
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.8496 - loss: 0.4951 - val_acc: 0.9417 - val_loss: 0.2653 - learning_rate: 1.0000e-23
Epoch 95/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9482 - val_f1: 0.9513 - val_precision: 0.9595 - val_recall: 0.9482
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 16ms/step - acc: 0.8357 - loss: 0.5144 - val_acc: 0.9417 - val_loss: 0.2653 - learning_rate: 1.0000e-23
Epoch 96/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9482 - val_f1: 0.9513 - val_precision: 0.9595 - val_recall: 0.9482
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 16ms/step - acc: 0.8024 - loss: 0.5325 - val_acc: 0.9417 - val_loss: 0.2653 - learning_rate: 1.0000e-23
Epoch 97/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step
 - val_accuracy: 0.9482 - val_f1: 0.9513 - val_precision: 0.9595 - val_recall: 0.9482
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 21ms/step - acc: 0.8493 - loss: 0.4608 - val_acc: 0.9417 - val_loss: 0.2653 - learning_rate: 1.0000e-23
Epoch 98/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9482 - val_f1: 0.9513 - val_precision: 0.9595 - val_recall: 0.9482
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 17ms/step - acc: 0.8401 - loss: 0.4727 - val_acc: 0.9417 - val_loss: 0.2653 - learning_rate: 1.0000e-27
Epoch 99/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9482 - val_f1: 0.9513 - val_precision: 0.9595 - val_recall: 0.9482
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 16ms/step - acc: 0.8282 - loss: 0.4883 - val_acc: 0.9417 - val_loss: 0.2653 - learning_rate: 1.0000e-27
Epoch 100/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9482 - val_f1: 0.9513 - val_precision: 0.9595 - val_recall: 0.9482
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.8215 - loss: 0.4982 - val_acc: 0.9417 - val_loss: 0.2653 - learning_rate: 1.0000e-27

Evaluating of Model Accuarcy - LSTM Embedded

In [ ]:
# evaluate the keras model
_, train_accuracy = model.evaluate(X_text_train, y_text_train, batch_size=8, verbose=0)
_, test_accuracy = model.evaluate(X_text_test, y_text_test, batch_size=8, verbose=0)

print('Train accuracy: %.2f' % (train_accuracy*100))
print('Test accuracy: %.2f' % (test_accuracy*100))
Train accuracy: 94.98
Test accuracy: 94.17

LSTM Embedded - Train Vs Test Accuarcy

In [ ]:
import matplotlib.pyplot as plt

# Data for the graph
categories = ['Train Accuracy', 'Test Accuracy']
values = [train_accuracy * 100, test_accuracy * 100]  # Convert to percentages

# Plotting the graph
plt.figure(figsize=(6, 4))
plt.bar(categories, values, color=['blue', 'orange'])
plt.ylim(0, 100)  # Accuracy is represented in percentage
plt.title('Model Accuracy: Train vs Test', fontsize=14)
plt.ylabel('Accuracy (%)', fontsize=12)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)

# Annotating the bars with accuracy values
for i, value in enumerate(values):
    plt.text(i, value + 2, f"{value:.2f}%", ha='center', fontsize=10)

# Display the graph
plt.show()
In [ ]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Predict the labels for the test data
y_pred = model.predict(X_text_test)

# If using multi-class classification, the predictions might be probabilities, so we need to convert them to class labels
y_pred_classes = y_pred.argmax(axis=-1)
y_true_classes = y_text_test.argmax(axis=-1)

# Compute metrics
accuracy = accuracy_score(y_true_classes, y_pred_classes)
precision = precision_score(y_true_classes, y_pred_classes, average='weighted')  # Use 'macro', 'micro', or 'weighted' for multi-class
recall = recall_score(y_true_classes, y_pred_classes, average='weighted')
f1 = f1_score(y_true_classes, y_pred_classes, average='weighted')

# Print the metrics
print('Accuracy: %f' % accuracy)
print('Precision: %f' % precision)
print('Recall: %f' % recall)
print('F1 score: %f' % f1)
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step
Accuracy: 0.941748
Precision: 0.954854
Recall: 0.941748
F1 score: 0.944093
In [ ]:
import matplotlib.pyplot as plt

# Metric values
metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score']
values = [accuracy, precision, recall, f1]

# Plotting the metrics
plt.figure(figsize=(8, 5))
plt.bar(metrics, values, color=['blue', 'orange', 'green', 'red'])
plt.ylim(0, 1)  # Metrics are usually in the range [0, 1]
plt.title('Model Performance Metrics-LSTM Embedded', fontsize=14)
plt.ylabel('Score', fontsize=12)
plt.xlabel('Metrics', fontsize=12)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)

# Annotating the values on top of the bars
for i, value in enumerate(values):
    plt.text(i, value + 0.02, f"{value:.2f}", ha='center', fontsize=10)

# Display the graph
plt.show()
In [ ]:
epochs = range(len(training_history.history['loss'])) # Get number of epochs

# plot loss learning curves
plt.plot(epochs, training_history.history['loss'], label = 'train')
plt.plot(epochs, training_history.history['val_loss'], label = 'test')
plt.legend(loc = 'upper right')
plt.title ('Training and validation loss')
Out[ ]:
Text(0.5, 1.0, 'Training and validation loss')

Observations

  • Above one is good fit, it is identified by a training and validation loss that decreases to a point of stability with a minimal gap between the two final loss values.
  • The loss of the model will almost always be lower on the training dataset than the validation dataset.
In [ ]:
# plot accuracy learning curves
plt.plot(epochs, training_history.history['acc'], label = 'train')
plt.plot(epochs, training_history.history['val_acc'], label = 'test')
plt.legend(loc = 'upper right')
plt.title ('Training and validation accuracy')
Out[ ]:
Text(0.5, 1.0, 'Training and validation accuracy')

Observations

  • We could see it accuracy continually rise during training.
  • As expected, we see the learning curves for accuracy on the test dataset plateau, indicating that the model has no longer overfit the training dataset and it is generalized model.
In [ ]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Predict probabilities or class labels for the test set
y_pred_prob = model.predict(X_text_test, batch_size=8)
y_pred = np.argmax(y_pred_prob, axis=1)  # Assuming the output is one-hot encoded

# Convert true labels to integers if needed (for one-hot encoding)
y_true = np.argmax(y_text_test, axis=1)

# Infer unique class labels from the data
unique_classes = np.unique(np.concatenate((y_true, y_pred)))

# Generate confusion matrix
cm = confusion_matrix(y_true, y_pred, labels=unique_classes)

# Plot confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=unique_classes)
disp.plot(cmap=plt.cm.Blues)
plt.title("Confusion Matrix")
plt.show()
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step 

LSTM Embedded with Sequential with other Activation (GELU & SELU)¶

Dense Layers with GELU and SELU:

  • Used activation=gelu and activation=selu where appropriate in the dense layers.
  • SELU is generally used in deeper architectures, and it can help with self-normalization. Here, it is used in the first and fourth dense layers.
  • GELU is often beneficial for smoother activation transitions and is used in the middle dense layers.

Building Simple LSTM Neural Network - Embedded with GELU & SELU

In [ ]:
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, LSTM, Bidirectional, GlobalMaxPool1D, Dropout, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.activations import gelu, selu
import numpy as np
import random

def reset_random_seeds():
    np.random.seed(7)
    random.seed(7)
    tf.random.set_seed(7)

# Call the reset function
reset_random_seeds()

# Define your model
deep_inputs = Input(shape=(maxlen,))
embedding_layer = Embedding(vocab_size, embedding_size, weights=[embedding_matrix], trainable=False)(deep_inputs)

LSTM_Layer_1 = Bidirectional(LSTM(128, return_sequences=True))(embedding_layer)
max_pool_layer_1 = GlobalMaxPool1D()(LSTM_Layer_1)
drop_out_layer_1 = Dropout(0.5)(max_pool_layer_1)

# First dense layer with SELU activation
dense_layer_1 = Dense(128, activation=selu)(drop_out_layer_1)
drop_out_layer_2 = Dropout(0.5)(dense_layer_1)

# Second dense layer with GELU activation
dense_layer_2 = Dense(64, activation=gelu)(drop_out_layer_2)
drop_out_layer_3 = Dropout(0.5)(dense_layer_2)

# Third dense layer with GELU activation
dense_layer_3 = Dense(32, activation=gelu)(drop_out_layer_3)
drop_out_layer_4 = Dropout(0.5)(dense_layer_3)

# Fourth dense layer with SELU activation
dense_layer_4 = Dense(10, activation=selu)(drop_out_layer_4)
drop_out_layer_5 = Dropout(0.5)(dense_layer_4)

# Output layer with softmax activation
dense_layer_5 = Dense(5, activation='softmax')(drop_out_layer_5)

model = Model(inputs=deep_inputs, outputs=dense_layer_5)

# Compile the model
opt = SGD(learning_rate=0.001, momentum=0.9)
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['acc'])

Model Summary of LSTM Embedded With GELU and SELU

In [ ]:
print(model.summary())
Model: "functional_3"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ input_layer_3 (InputLayer)           │ (None, 100)                 │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ embedding_3 (Embedding)              │ (None, 100, 300)            │         650,700 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ bidirectional_3 (Bidirectional)      │ (None, 100, 256)            │         439,296 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ global_max_pooling1d_3               │ (None, 256)                 │               0 │
│ (GlobalMaxPooling1D)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_15 (Dropout)                 │ (None, 256)                 │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_15 (Dense)                     │ (None, 128)                 │          32,896 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_16 (Dropout)                 │ (None, 128)                 │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_16 (Dense)                     │ (None, 64)                  │           8,256 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_17 (Dropout)                 │ (None, 64)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_17 (Dense)                     │ (None, 32)                  │           2,080 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_18 (Dropout)                 │ (None, 32)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_18 (Dense)                     │ (None, 10)                  │             330 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_19 (Dropout)                 │ (None, 10)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_19 (Dense)                     │ (None, 5)                   │              55 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 1,133,613 (4.32 MB)
 Trainable params: 482,913 (1.84 MB)
 Non-trainable params: 650,700 (2.48 MB)
None

Model Summary LSTM Embedded -GELU & SELU

In [ ]:
from keras.utils import plot_model
from tensorflow.keras.utils import to_categorical


plot_model(model, to_file='model_plot1.png', show_shapes=True, show_dtype=True, show_layer_names=True)
Out[ ]:
In [ ]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from tensorflow.keras.callbacks import Callback

class Metrics(Callback):
    def __init__(self, validation_data, target_type='multi_label'):
        super(Metrics, self).__init__()
        self.validation_data = validation_data
        self.target_type = target_type

    def on_epoch_end(self, epoch, logs=None):
        # Extract validation data and labels
        val_data, val_labels, _ = self.validation_data

        # Predict the output using the model
        val_predictions = self.model.predict(val_data)

        # For multi-label classification, threshold predictions at 0.5
        if self.target_type == 'multi_label':
            val_predictions = (val_predictions > 0.5).astype(int)

            # Calculate metrics
            val_accuracy = accuracy_score(val_labels, val_predictions)
            val_f1 = f1_score(val_labels, val_predictions, average='macro')
            val_precision = precision_score(val_labels, val_predictions, average='macro')
            val_recall = recall_score(val_labels, val_predictions, average='macro')
        else:
            val_predictions = val_predictions.argmax(axis=1)
            val_labels = val_labels.argmax(axis=1)

            val_accuracy = accuracy_score(val_labels, val_predictions)
            val_f1 = f1_score(val_labels, val_predictions, average='macro')
            val_precision = precision_score(val_labels, val_predictions, average='macro')
            val_recall = recall_score(val_labels, val_predictions, average='macro')

        # Print the metrics for the validation set
        print(f" - val_accuracy: {val_accuracy:.4f} - val_f1: {val_f1:.4f} - val_precision: {val_precision:.4f} - val_recall: {val_recall:.4f}")
In [ ]:
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
# Use earlystopping
# callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=5, min_delta=0.001)
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=7, min_delta=1E-3)
rlrp = ReduceLROnPlateau(monitor='val_loss', factor=0.0001, patience=5, min_delta=1E-4)

target_type = 'multi_label'
metrics = Metrics(validation_data=(X_text_train, y_text_train, target_type))

# fit the keras model on the dataset
training_history = model.fit(X_text_train, y_text_train, epochs=100, batch_size=8, verbose=1, validation_data=(X_text_test, y_text_test), callbacks=[rlrp, metrics])
Epoch 1/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 15ms/step
 - val_accuracy: 0.0000 - val_f1: 0.0000 - val_precision: 0.0000 - val_recall: 0.0000
155/155 ━━━━━━━━━━━━━━━━━━━━ 10s 32ms/step - acc: 0.2293 - loss: 1.8443 - val_acc: 0.3851 - val_loss: 1.4293 - learning_rate: 0.0010
Epoch 2/100
  6/155 ━━━━━━━━━━━━━━━━━━━━ 3s 25ms/step - acc: 0.1882 - loss: 1.6576
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.1812 - val_f1: 0.1902 - val_precision: 0.2000 - val_recall: 0.1814
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 22ms/step - acc: 0.3038 - loss: 1.4888 - val_acc: 0.9029 - val_loss: 1.0985 - learning_rate: 0.0010
Epoch 3/100
  7/155 ━━━━━━━━━━━━━━━━━━━━ 2s 18ms/step - acc: 0.5550 - loss: 1.3083
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.3600 - val_f1: 0.3787 - val_precision: 0.4000 - val_recall: 0.3596
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 21ms/step - acc: 0.4782 - loss: 1.2795 - val_acc: 0.9353 - val_loss: 0.7320 - learning_rate: 0.0010
Epoch 4/100
  7/155 ━━━━━━━━━━━━━━━━━━━━ 2s 18ms/step - acc: 0.4359 - loss: 1.2895
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step
 - val_accuracy: 0.7298 - val_f1: 0.7628 - val_precision: 0.8000 - val_recall: 0.7296
155/155 ━━━━━━━━━━━━━━━━━━━━ 4s 27ms/step - acc: 0.5559 - loss: 1.0704 - val_acc: 0.9417 - val_loss: 0.5159 - learning_rate: 0.0010
Epoch 5/100
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.7759 - val_f1: 0.8332 - val_precision: 0.9481 - val_recall: 0.7758
155/155 ━━━━━━━━━━━━━━━━━━━━ 4s 23ms/step - acc: 0.6609 - loss: 0.9069 - val_acc: 0.9385 - val_loss: 0.3797 - learning_rate: 0.0010
Epoch 6/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.9167 - val_f1: 0.9284 - val_precision: 0.9486 - val_recall: 0.9167
155/155 ━━━━━━━━━━━━━━━━━━━━ 4s 17ms/step - acc: 0.7281 - loss: 0.7715 - val_acc: 0.9417 - val_loss: 0.3182 - learning_rate: 0.0010
Epoch 7/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step
 - val_accuracy: 0.9288 - val_f1: 0.9328 - val_precision: 0.9482 - val_recall: 0.9288
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 17ms/step - acc: 0.7559 - loss: 0.6967 - val_acc: 0.9417 - val_loss: 0.2804 - learning_rate: 0.0010
Epoch 8/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.9296 - val_f1: 0.9332 - val_precision: 0.9483 - val_recall: 0.9296
155/155 ━━━━━━━━━━━━━━━━━━━━ 6s 25ms/step - acc: 0.7949 - loss: 0.6197 - val_acc: 0.9417 - val_loss: 0.2545 - learning_rate: 0.0010
Epoch 9/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.9296 - val_f1: 0.9335 - val_precision: 0.9488 - val_recall: 0.9296
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 25ms/step - acc: 0.8201 - loss: 0.5701 - val_acc: 0.9417 - val_loss: 0.2500 - learning_rate: 0.0010
Epoch 10/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9296 - val_f1: 0.9329 - val_precision: 0.9479 - val_recall: 0.9296
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.8061 - loss: 0.5803 - val_acc: 0.9417 - val_loss: 0.2398 - learning_rate: 0.0010
Epoch 11/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9296 - val_f1: 0.9329 - val_precision: 0.9479 - val_recall: 0.9296
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.8042 - loss: 0.5428 - val_acc: 0.9417 - val_loss: 0.2402 - learning_rate: 0.0010
Epoch 12/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9296 - val_f1: 0.9329 - val_precision: 0.9479 - val_recall: 0.9296
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.8050 - loss: 0.5604 - val_acc: 0.9417 - val_loss: 0.2329 - learning_rate: 0.0010
Epoch 13/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9296 - val_f1: 0.9332 - val_precision: 0.9483 - val_recall: 0.9296
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 15ms/step - acc: 0.8434 - loss: 0.4974 - val_acc: 0.9417 - val_loss: 0.2301 - learning_rate: 0.0010
Epoch 14/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.9296 - val_f1: 0.9335 - val_precision: 0.9488 - val_recall: 0.9296
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 20ms/step - acc: 0.8472 - loss: 0.4469 - val_acc: 0.9417 - val_loss: 0.2333 - learning_rate: 0.0010
Epoch 15/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9296 - val_f1: 0.9332 - val_precision: 0.9483 - val_recall: 0.9296
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 16ms/step - acc: 0.8476 - loss: 0.4419 - val_acc: 0.9417 - val_loss: 0.2258 - learning_rate: 0.0010
Epoch 16/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9296 - val_f1: 0.9335 - val_precision: 0.9488 - val_recall: 0.9296
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.8601 - loss: 0.4604 - val_acc: 0.9417 - val_loss: 0.2282 - learning_rate: 0.0010
Epoch 17/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9296 - val_f1: 0.9335 - val_precision: 0.9488 - val_recall: 0.9296
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.8627 - loss: 0.4180 - val_acc: 0.9417 - val_loss: 0.2248 - learning_rate: 0.0010
Epoch 18/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.9296 - val_f1: 0.9341 - val_precision: 0.9497 - val_recall: 0.9296
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 18ms/step - acc: 0.8708 - loss: 0.3849 - val_acc: 0.9417 - val_loss: 0.2242 - learning_rate: 0.0010
Epoch 19/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9320 - val_f1: 0.9366 - val_precision: 0.9515 - val_recall: 0.9321
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 33ms/step - acc: 0.8588 - loss: 0.4334 - val_acc: 0.9417 - val_loss: 0.2263 - learning_rate: 0.0010
Epoch 20/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9337 - val_f1: 0.9374 - val_precision: 0.9515 - val_recall: 0.9337
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.8655 - loss: 0.4129 - val_acc: 0.9417 - val_loss: 0.2251 - learning_rate: 0.0010
Epoch 21/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.9345 - val_f1: 0.9379 - val_precision: 0.9515 - val_recall: 0.9345
155/155 ━━━━━━━━━━━━━━━━━━━━ 4s 23ms/step - acc: 0.8790 - loss: 0.3572 - val_acc: 0.9417 - val_loss: 0.2298 - learning_rate: 0.0010
Epoch 22/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step
 - val_accuracy: 0.9377 - val_f1: 0.9437 - val_precision: 0.9578 - val_recall: 0.9377
155/155 ━━━━━━━━━━━━━━━━━━━━ 9s 47ms/step - acc: 0.8776 - loss: 0.3524 - val_acc: 0.9417 - val_loss: 0.2304 - learning_rate: 0.0010
Epoch 23/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step
 - val_accuracy: 0.9434 - val_f1: 0.9489 - val_precision: 0.9614 - val_recall: 0.9434
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 33ms/step - acc: 0.8795 - loss: 0.3448 - val_acc: 0.9417 - val_loss: 0.2289 - learning_rate: 0.0010
Epoch 24/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 6s 37ms/step - acc: 0.8608 - loss: 0.3997 - val_acc: 0.9417 - val_loss: 0.2280 - learning_rate: 1.0000e-07
Epoch 25/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 7s 16ms/step - acc: 0.8796 - loss: 0.3379 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-07
Epoch 26/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.8701 - loss: 0.3372 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-07
Epoch 27/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.8855 - loss: 0.3333 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-07
Epoch 28/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 21ms/step - acc: 0.8958 - loss: 0.3138 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-07
Epoch 29/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 4s 15ms/step - acc: 0.8763 - loss: 0.3621 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-11
Epoch 30/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 16ms/step - acc: 0.8473 - loss: 0.3744 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-11
Epoch 31/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 6s 19ms/step - acc: 0.8827 - loss: 0.3441 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-11
Epoch 32/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 16ms/step - acc: 0.8852 - loss: 0.3361 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-11
Epoch 33/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.8851 - loss: 0.3176 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-11
Epoch 34/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 15ms/step - acc: 0.9020 - loss: 0.3241 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-15
Epoch 35/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.8690 - loss: 0.3478 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-15
Epoch 36/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 22ms/step - acc: 0.8995 - loss: 0.3171 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-15
Epoch 37/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 4s 16ms/step - acc: 0.8854 - loss: 0.3085 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-15
Epoch 38/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.8876 - loss: 0.3421 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-15
Epoch 39/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.8676 - loss: 0.3811 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-19
Epoch 40/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 19ms/step - acc: 0.8698 - loss: 0.3872 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-19
Epoch 41/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 20ms/step - acc: 0.8747 - loss: 0.3682 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-19
Epoch 42/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 16ms/step - acc: 0.8657 - loss: 0.3622 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-19
Epoch 43/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.8755 - loss: 0.3224 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-19
Epoch 44/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 16ms/step - acc: 0.8852 - loss: 0.3511 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-23
Epoch 45/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 21ms/step - acc: 0.9079 - loss: 0.3191 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-23
Epoch 46/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 17ms/step - acc: 0.8762 - loss: 0.3292 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-23
Epoch 47/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.8821 - loss: 0.3379 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-23
Epoch 48/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.8918 - loss: 0.3320 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-23
Epoch 49/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.8897 - loss: 0.3096 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-27
Epoch 50/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 18ms/step - acc: 0.8837 - loss: 0.3618 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-27
Epoch 51/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 21ms/step - acc: 0.8761 - loss: 0.3305 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-27
Epoch 52/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.8710 - loss: 0.3661 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-27
Epoch 53/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.8727 - loss: 0.3650 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-27
Epoch 54/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 16ms/step - acc: 0.8740 - loss: 0.3451 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-31
Epoch 55/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.8571 - loss: 0.3933 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-31
Epoch 56/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 4s 21ms/step - acc: 0.8610 - loss: 0.3986 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-31
Epoch 57/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 17ms/step - acc: 0.8845 - loss: 0.3374 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-31
Epoch 58/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 16ms/step - acc: 0.8848 - loss: 0.3495 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-31
Epoch 59/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 6s 21ms/step - acc: 0.8797 - loss: 0.3413 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-35
Epoch 60/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 18ms/step - acc: 0.8933 - loss: 0.3197 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-35
Epoch 61/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.8912 - loss: 0.3465 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-35
Epoch 62/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.8747 - loss: 0.3652 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-35
Epoch 63/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.8782 - loss: 0.3702 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-35
Epoch 64/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 18ms/step - acc: 0.8689 - loss: 0.3555 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-39
Epoch 65/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 16ms/step - acc: 0.8686 - loss: 0.3648 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-39
Epoch 66/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 16ms/step - acc: 0.8703 - loss: 0.3919 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-39
Epoch 67/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 16ms/step - acc: 0.8598 - loss: 0.3718 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-39
Epoch 68/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 17ms/step - acc: 0.8898 - loss: 0.3245 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 1.0000e-39
Epoch 69/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 4s 24ms/step - acc: 0.8715 - loss: 0.3582 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 9.9492e-44
Epoch 70/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 4s 16ms/step - acc: 0.8790 - loss: 0.3630 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 9.9492e-44
Epoch 71/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.9067 - loss: 0.3019 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 9.9492e-44
Epoch 72/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 16ms/step - acc: 0.8932 - loss: 0.3191 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 9.9492e-44
Epoch 73/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 20ms/step - acc: 0.8815 - loss: 0.3560 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 9.9492e-44
Epoch 74/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 6s 23ms/step - acc: 0.8965 - loss: 0.3381 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 0.0000e+00
Epoch 75/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.8490 - loss: 0.3994 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 0.0000e+00
Epoch 76/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.8725 - loss: 0.3703 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 0.0000e+00
Epoch 77/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 21ms/step - acc: 0.8735 - loss: 0.3769 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 0.0000e+00
Epoch 78/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 4s 17ms/step - acc: 0.8816 - loss: 0.3532 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 0.0000e+00
Epoch 79/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.8798 - loss: 0.3455 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 0.0000e+00
Epoch 80/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.8941 - loss: 0.3507 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 0.0000e+00
Epoch 81/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 19ms/step - acc: 0.8746 - loss: 0.3739 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 0.0000e+00
Epoch 82/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 16ms/step - acc: 0.8677 - loss: 0.3728 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 0.0000e+00
Epoch 83/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.8654 - loss: 0.3497 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 0.0000e+00
Epoch 84/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 16ms/step - acc: 0.9061 - loss: 0.3132 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 0.0000e+00
Epoch 85/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 18ms/step - acc: 0.8893 - loss: 0.3214 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 0.0000e+00
Epoch 86/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 22ms/step - acc: 0.8816 - loss: 0.3381 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 0.0000e+00
Epoch 87/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.8820 - loss: 0.3454 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 0.0000e+00
Epoch 88/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 16ms/step - acc: 0.8767 - loss: 0.3440 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 0.0000e+00
Epoch 89/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 16ms/step - acc: 0.8842 - loss: 0.3424 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 0.0000e+00
Epoch 90/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 15ms/step - acc: 0.8672 - loss: 0.3585 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 0.0000e+00
Epoch 91/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 21ms/step - acc: 0.8769 - loss: 0.3476 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 0.0000e+00
Epoch 92/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 17ms/step - acc: 0.8699 - loss: 0.3684 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 0.0000e+00
Epoch 93/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 16ms/step - acc: 0.8849 - loss: 0.3213 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 0.0000e+00
Epoch 94/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 18ms/step - acc: 0.8842 - loss: 0.3376 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 0.0000e+00
Epoch 95/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 21ms/step - acc: 0.8929 - loss: 0.3380 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 0.0000e+00
Epoch 96/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.8877 - loss: 0.3609 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 0.0000e+00
Epoch 97/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.8730 - loss: 0.3576 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 0.0000e+00
Epoch 98/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 3s 16ms/step - acc: 0.8754 - loss: 0.3518 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 0.0000e+00
Epoch 99/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - acc: 0.8649 - loss: 0.3426 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 0.0000e+00
Epoch 100/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step
 - val_accuracy: 0.9426 - val_f1: 0.9485 - val_precision: 0.9614 - val_recall: 0.9426
155/155 ━━━━━━━━━━━━━━━━━━━━ 4s 23ms/step - acc: 0.8981 - loss: 0.3220 - val_acc: 0.9417 - val_loss: 0.2279 - learning_rate: 0.0000e+00

Evaluating of Model Accuarcy - LSTM Embedded with GELU & SELU

In [ ]:
# evaluate the keras model
_, train_accuracy = model.evaluate(X_text_train, y_text_train, batch_size=8, verbose=0)
_, test_accuracy = model.evaluate(X_text_test, y_text_test, batch_size=8, verbose=0)

print('Train accuracy: %.2f' % (train_accuracy*100))
print('Test accuracy: %.2f' % (test_accuracy*100))
Train accuracy: 94.58
Test accuracy: 94.17

Plotting Model Accuarcy - LSTM Embedded with GELU & SELU

In [ ]:
import matplotlib.pyplot as plt

# Data for the graph
categories = ['Train Accuracy', 'Test Accuracy']
values = [train_accuracy * 100, test_accuracy * 100]  # Convert to percentages

# Plotting the graph
plt.figure(figsize=(6, 4))
plt.bar(categories, values, color=['blue', 'orange'])
plt.ylim(0, 100)  # Accuracy is represented in percentage
plt.title('Model Accuracy: Train vs Test', fontsize=14)
plt.ylabel('Accuracy (%)', fontsize=12)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)

# Annotating the bars with accuracy values
for i, value in enumerate(values):
    plt.text(i, value + 2, f"{value:.2f}%", ha='center', fontsize=10)

# Display the graph
plt.show()
In [ ]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Predict the labels for the test data
y_pred = model.predict(X_text_test)

# If using multi-class classification, the predictions might be probabilities, so we need to convert them to class labels
y_pred_classes = y_pred.argmax(axis=-1)
y_true_classes = y_text_test.argmax(axis=-1)

# Compute metrics
accuracy = accuracy_score(y_true_classes, y_pred_classes)
precision = precision_score(y_true_classes, y_pred_classes, average='weighted')  # Use 'macro', 'micro', or 'weighted' for multi-class
recall = recall_score(y_true_classes, y_pred_classes, average='weighted')
f1 = f1_score(y_true_classes, y_pred_classes, average='weighted')

# Print the metrics
print('Accuracy: %f' % accuracy)
print('Precision: %f' % precision)
print('Recall: %f' % recall)
print('F1 score: %f' % f1)
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 13ms/step
Accuracy: 0.941748
Precision: 0.953303
Recall: 0.941748
F1 score: 0.943735

Model Performance Metrics-LSTM Embedded -GELU & SELU

In [ ]:
import matplotlib.pyplot as plt

# Metric values
metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score']
values = [accuracy, precision, recall, f1]

# Plotting the metrics
plt.figure(figsize=(8, 5))
plt.bar(metrics, values, color=['blue', 'orange', 'green', 'red'])
plt.ylim(0, 1)  # Metrics are usually in the range [0, 1]
plt.title('Model Performance Metrics-LSTM Embedded -GELU & SELU', fontsize=14)
plt.ylabel('Score', fontsize=12)
plt.xlabel('Metrics', fontsize=12)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)

# Annotating the values on top of the bars
for i, value in enumerate(values):
    plt.text(i, value + 0.02, f"{value:.2f}", ha='center', fontsize=10)

# Display the graph
plt.show()

Training and validation loss -GELU & SELU

In [ ]:
epochs = range(len(training_history.history['loss'])) # Get number of epochs

# plot loss learning curves
plt.plot(epochs, training_history.history['loss'], label = 'train')
plt.plot(epochs, training_history.history['val_loss'], label = 'test')
plt.legend(loc = 'upper right')
plt.title ('Training and validation loss -GELU & SELU')
Out[ ]:
Text(0.5, 1.0, 'Training and validation loss')

LSTM Embedded Confusion Matrix -GELU & SELU

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Predict probabilities or class labels for the test set
y_pred_prob = model.predict(X_text_test, batch_size=8)
y_pred = np.argmax(y_pred_prob, axis=1)  # Assuming the output is one-hot encoded

# Convert true labels to integers if needed (for one-hot encoding)
y_true = np.argmax(y_text_test, axis=1)

# Infer unique class labels from the data
unique_classes = np.unique(np.concatenate((y_true, y_pred)))

# Generate confusion matrix
cm = confusion_matrix(y_true, y_pred, labels=unique_classes)

# Plot confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=unique_classes)
disp.plot(cmap=plt.cm.Blues)
plt.title("Confusion Matrix  -GELU & SELU")
plt.show()
39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 15ms/step

Simple RNN Embedding layer without Averaging (Sequential)¶

When to Use SimpleRNN:

  • SimpleRNN is suitable for small datasets and problems where long-term dependencies are not critical. If your task requires learning long-term dependencies, consider using LSTM or GRU. Comparison:

  • While SimpleRNN is computationally less expensive, it can suffer from the vanishing gradient problem when used with long sequences.

In [ ]:
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, SimpleRNN, Bidirectional, GlobalMaxPool1D, Dropout, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import SGD
import numpy as np
import random

def reset_random_seeds():
    np.random.seed(7)
    random.seed(7)
    tf.random.set_seed(7)

# Call the reset function
reset_random_seeds()

# Define your RNN model
deep_inputs = Input(shape=(maxlen,))
embedding_layer = Embedding(vocab_size, embedding_size, weights=[embedding_matrix], trainable=False)(deep_inputs)

# Replace the LSTM layer with a SimpleRNN layer
RNN_Layer_1 = Bidirectional(SimpleRNN(128, return_sequences=True))(embedding_layer)
max_pool_layer_1 = GlobalMaxPool1D()(RNN_Layer_1)
drop_out_layer_1 = Dropout(0.5)(max_pool_layer_1)
dense_layer_1 = Dense(128, activation='relu')(drop_out_layer_1)
drop_out_layer_2 = Dropout(0.5)(dense_layer_1)
dense_layer_2 = Dense(64, activation='relu')(drop_out_layer_2)
drop_out_layer_3 = Dropout(0.5)(dense_layer_2)

dense_layer_3 = Dense(32, activation='relu')(drop_out_layer_3)
drop_out_layer_4 = Dropout(0.5)(dense_layer_3)

dense_layer_4 = Dense(10, activation='relu')(drop_out_layer_4)
drop_out_layer_5 = Dropout(0.5)(dense_layer_4)

dense_layer_5 = Dense(5, activation='softmax')(drop_out_layer_5)

model_RNN = Model(inputs=deep_inputs, outputs=dense_layer_5)

# Compile the model
opt = SGD(learning_rate=0.001, momentum=0.9)  # Updated to use 'learning_rate'
model_RNN.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['acc'])

# Print the model summary
model_RNN.summary()
Model: "functional_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ input_layer_4 (InputLayer)           │ (None, 100)                 │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ embedding_4 (Embedding)              │ (None, 100, 300)            │         650,700 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ bidirectional_4 (Bidirectional)      │ (None, 100, 256)            │         109,824 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ global_max_pooling1d_4               │ (None, 256)                 │               0 │
│ (GlobalMaxPooling1D)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_20 (Dropout)                 │ (None, 256)                 │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_20 (Dense)                     │ (None, 128)                 │          32,896 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_21 (Dropout)                 │ (None, 128)                 │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_21 (Dense)                     │ (None, 64)                  │           8,256 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_22 (Dropout)                 │ (None, 64)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_22 (Dense)                     │ (None, 32)                  │           2,080 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_23 (Dropout)                 │ (None, 32)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_23 (Dense)                     │ (None, 10)                  │             330 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_24 (Dropout)                 │ (None, 10)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_24 (Dense)                     │ (None, 5)                   │              55 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 804,141 (3.07 MB)
 Trainable params: 153,441 (599.38 KB)
 Non-trainable params: 650,700 (2.48 MB)

Plotting the Model Summary - Simple RNN Embedded

In [ ]:
print(model_RNN.summary())
Model: "functional_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ input_layer_4 (InputLayer)           │ (None, 100)                 │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ embedding_4 (Embedding)              │ (None, 100, 300)            │         650,700 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ bidirectional_4 (Bidirectional)      │ (None, 100, 256)            │         109,824 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ global_max_pooling1d_4               │ (None, 256)                 │               0 │
│ (GlobalMaxPooling1D)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_20 (Dropout)                 │ (None, 256)                 │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_20 (Dense)                     │ (None, 128)                 │          32,896 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_21 (Dropout)                 │ (None, 128)                 │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_21 (Dense)                     │ (None, 64)                  │           8,256 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_22 (Dropout)                 │ (None, 64)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_22 (Dense)                     │ (None, 32)                  │           2,080 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_23 (Dropout)                 │ (None, 32)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_23 (Dense)                     │ (None, 10)                  │             330 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_24 (Dropout)                 │ (None, 10)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_24 (Dense)                     │ (None, 5)                   │              55 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 804,141 (3.07 MB)
 Trainable params: 153,441 (599.38 KB)
 Non-trainable params: 650,700 (2.48 MB)
None

Plotting the Model Summary - Simple RNN

In [ ]:
from keras.utils import plot_model
from tensorflow.keras.utils import to_categorical


plot_model(model_RNN, to_file='model_plot1.png', show_shapes=True, show_dtype=True, show_layer_names=True)
Out[ ]:
In [ ]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from tensorflow.keras.callbacks import Callback

class Metrics(Callback):
    def __init__(self, validation_data, target_type='multi_label'):
        super(Metrics, self).__init__()
        self.validation_data = validation_data
        self.target_type = target_type

    def on_epoch_end(self, epoch, logs=None):
        # Extract validation data and labels
        val_data, val_labels, _ = self.validation_data

        # Predict the output using the model
        val_predictions = self.model_RNN.predict(val_data)

        # For multi-label classification, threshold predictions at 0.5
        if self.target_type == 'multi_label':
            val_predictions = (val_predictions > 0.5).astype(int)

            # Calculate metrics
            val_accuracy = accuracy_score(val_labels, val_predictions)
            val_f1 = f1_score(val_labels, val_predictions, average='macro')
            val_precision = precision_score(val_labels, val_predictions, average='macro')
            val_recall = recall_score(val_labels, val_predictions, average='macro')
        else:
            val_predictions = val_predictions.argmax(axis=1)
            val_labels = val_labels.argmax(axis=1)

            val_accuracy = accuracy_score(val_labels, val_predictions)
            val_f1 = f1_score(val_labels, val_predictions, average='macro')
            val_precision = precision_score(val_labels, val_predictions, average='macro')
            val_recall = recall_score(val_labels, val_predictions, average='macro')

        # Print the metrics for the validation set
        print(f" - val_accuracy: {val_accuracy:.4f} - val_f1: {val_f1:.4f} - val_precision: {val_precision:.4f} - val_recall: {val_recall:.4f}")
In [ ]:
import tensorflow as tf

class Metrics(tf.keras.callbacks.Callback):
    def __init__(self, validation_data, target_type):
        super(Metrics, self).__init__()
        self.validation_data = validation_data
        self.target_type = target_type

    def on_epoch_end(self, epoch, logs=None):
        # Extract validation data
        X_val, y_val = self.validation_data

        # Use self.model to access the current model
        y_pred = self.model.predict(X_val)

        # Implement your custom metrics logic here
        if self.target_type == 'multi_label':
            # Example for multi-label case
            print(f"Epoch {epoch + 1}: Custom metrics can be computed here.")

        # Optionally, add results to logs for tracking
        logs = logs or {}
        logs['custom_metric'] = 0.95  # Replace with real computation
In [ ]:
# Create the Metrics callback
metrics = Metrics(validation_data=(X_text_train, y_text_train), target_type=target_type)

# Fit the RNN model
training_history = model_RNN.fit(
    X_text_train,
    y_text_train,
    epochs=100,
    batch_size=8,
    verbose=1,
    validation_data=(X_text_test, y_text_test),
    callbacks=[rlrp, callback, metrics]  # Include the fixed Metrics callback
)
Epoch 1/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 2s 28ms/step
Epoch 1: Custom metrics can be computed here.
155/155 ━━━━━━━━━━━━━━━━━━━━ 18s 77ms/step - acc: 0.2011 - loss: 2.0915 - val_acc: 0.2006 - val_loss: 1.6102 - learning_rate: 0.0010 - custom_metric: 0.9500
Epoch 2/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 13ms/step
Epoch 2: Custom metrics can be computed here.
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 33ms/step - acc: 0.2059 - loss: 1.6527 - val_acc: 0.2006 - val_loss: 1.6090 - learning_rate: 0.0010 - custom_metric: 0.9500
Epoch 3/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step
Epoch 3: Custom metrics can be computed here.
155/155 ━━━━━━━━━━━━━━━━━━━━ 4s 28ms/step - acc: 0.1934 - loss: 1.6326 - val_acc: 0.2006 - val_loss: 1.6091 - learning_rate: 0.0010 - custom_metric: 0.9500
Epoch 4/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step
Epoch 4: Custom metrics can be computed here.
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 34ms/step - acc: 0.2028 - loss: 1.6154 - val_acc: 0.2006 - val_loss: 1.6105 - learning_rate: 0.0010 - custom_metric: 0.9500
Epoch 5/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step
Epoch 5: Custom metrics can be computed here.
155/155 ━━━━━━━━━━━━━━━━━━━━ 9s 29ms/step - acc: 0.1948 - loss: 1.6197 - val_acc: 0.2006 - val_loss: 1.6096 - learning_rate: 0.0010 - custom_metric: 0.9500
Epoch 6/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step
Epoch 6: Custom metrics can be computed here.
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 29ms/step - acc: 0.2084 - loss: 1.6134 - val_acc: 0.2006 - val_loss: 1.6094 - learning_rate: 0.0010 - custom_metric: 0.9500
Epoch 7/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step
Epoch 7: Custom metrics can be computed here.
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 34ms/step - acc: 0.1915 - loss: 1.6162 - val_acc: 0.2006 - val_loss: 1.6098 - learning_rate: 0.0010 - custom_metric: 0.9500
Epoch 8/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step
Epoch 8: Custom metrics can be computed here.
155/155 ━━━━━━━━━━━━━━━━━━━━ 10s 31ms/step - acc: 0.1906 - loss: 1.6179 - val_acc: 0.2006 - val_loss: 1.6100 - learning_rate: 1.0000e-07 - custom_metric: 0.9500
Epoch 9/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step
Epoch 9: Custom metrics can be computed here.
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 32ms/step - acc: 0.2023 - loss: 1.6090 - val_acc: 0.2006 - val_loss: 1.6100 - learning_rate: 1.0000e-07 - custom_metric: 0.9500
Epoch 10/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step
Epoch 10: Custom metrics can be computed here.
155/155 ━━━━━━━━━━━━━━━━━━━━ 4s 28ms/step - acc: 0.2179 - loss: 1.6102 - val_acc: 0.2006 - val_loss: 1.6100 - learning_rate: 1.0000e-07 - custom_metric: 0.9500
Epoch 11/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 13ms/step
Epoch 11: Custom metrics can be computed here.
155/155 ━━━━━━━━━━━━━━━━━━━━ 6s 31ms/step - acc: 0.1910 - loss: 1.6140 - val_acc: 0.2006 - val_loss: 1.6100 - learning_rate: 1.0000e-07 - custom_metric: 0.9500
Epoch 12/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step
Epoch 12: Custom metrics can be computed here.
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 32ms/step - acc: 0.2229 - loss: 1.6130 - val_acc: 0.2006 - val_loss: 1.6100 - learning_rate: 1.0000e-07 - custom_metric: 0.9500
Epoch 13/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step
Epoch 13: Custom metrics can be computed here.
155/155 ━━━━━━━━━━━━━━━━━━━━ 4s 28ms/step - acc: 0.2242 - loss: 1.6118 - val_acc: 0.2006 - val_loss: 1.6100 - learning_rate: 1.0000e-11 - custom_metric: 0.9500
Epoch 14/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step
Epoch 14: Custom metrics can be computed here.
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 29ms/step - acc: 0.2021 - loss: 1.6124 - val_acc: 0.2006 - val_loss: 1.6100 - learning_rate: 1.0000e-11 - custom_metric: 0.9500
Epoch 15/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step
Epoch 15: Custom metrics can be computed here.
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 30ms/step - acc: 0.2026 - loss: 1.6079 - val_acc: 0.2006 - val_loss: 1.6100 - learning_rate: 1.0000e-11 - custom_metric: 0.9500
Epoch 16/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step
Epoch 16: Custom metrics can be computed here.
155/155 ━━━━━━━━━━━━━━━━━━━━ 4s 28ms/step - acc: 0.1972 - loss: 1.6186 - val_acc: 0.2006 - val_loss: 1.6100 - learning_rate: 1.0000e-11 - custom_metric: 0.9500
Epoch 17/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step
Epoch 17: Custom metrics can be computed here.
155/155 ━━━━━━━━━━━━━━━━━━━━ 6s 33ms/step - acc: 0.2084 - loss: 1.6138 - val_acc: 0.2006 - val_loss: 1.6100 - learning_rate: 1.0000e-11 - custom_metric: 0.9500
Epoch 18/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step
Epoch 18: Custom metrics can be computed here.
155/155 ━━━━━━━━━━━━━━━━━━━━ 4s 29ms/step - acc: 0.2127 - loss: 1.6119 - val_acc: 0.2006 - val_loss: 1.6100 - learning_rate: 1.0000e-15 - custom_metric: 0.9500
Epoch 19/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step
Epoch 19: Custom metrics can be computed here.
155/155 ━━━━━━━━━━━━━━━━━━━━ 4s 28ms/step - acc: 0.1965 - loss: 1.6118 - val_acc: 0.2006 - val_loss: 1.6100 - learning_rate: 1.0000e-15 - custom_metric: 0.9500
Epoch 20/100
39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 12ms/step
Epoch 20: Custom metrics can be computed here.
155/155 ━━━━━━━━━━━━━━━━━━━━ 5s 33ms/step - acc: 0.1904 - loss: 1.6179 - val_acc: 0.2006 - val_loss: 1.6100 - learning_rate: 1.0000e-15 - custom_metric: 0.9500

Evaluating of Model Accuarcy - Simple RNN

In [ ]:
# evaluate the keras model
_, train_accuracy = model_RNN.evaluate(X_text_train, y_text_train, batch_size=8, verbose=0)
_, test_accuracy = model_RNN.evaluate(X_text_test, y_text_test, batch_size=8, verbose=0)

print('Train accuracy: %.2f' % (train_accuracy*100))
print('Test accuracy: %.2f' % (test_accuracy*100))
Train accuracy: 19.98
Test accuracy: 20.06
In [ ]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Predict the labels for the test data
y_pred = model_RNN.predict(X_text_test)

# If using multi-class classification, the predictions might be probabilities, so we need to convert them to class labels
y_pred_classes = y_pred.argmax(axis=-1)
y_true_classes = y_text_test.argmax(axis=-1)

# Compute metrics
accuracy = accuracy_score(y_true_classes, y_pred_classes)
precision = precision_score(y_true_classes, y_pred_classes, average='weighted')  # Use 'macro', 'micro', or 'weighted' for multi-class
recall = recall_score(y_true_classes, y_pred_classes, average='weighted')
f1 = f1_score(y_true_classes, y_pred_classes, average='weighted')

# Print the metrics
print('Accuracy: %f' % accuracy)
print('Precision: %f' % precision)
print('Recall: %f' % recall)
print('F1 score: %f' % f1)
3/3 ━━━━━━━━━━━━━━━━━━━━ 1s 611ms/step
Accuracy: 0.738095
Precision: 0.544785
Recall: 0.738095
F1 score: 0.626875
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
In [ ]:
epochs = range(len(training_history.history['loss'])) # Get number of epochs

# plot loss learning curves
plt.plot(epochs, training_history.history['loss'], label = 'train')
plt.plot(epochs, training_history.history['val_loss'], label = 'test')
plt.legend(loc = 'upper right')
plt.title ('Training and validation loss')
Out[ ]:
Text(0.5, 1.0, 'Training and validation loss')

Observation

  • The performance of two LSTM Hypertuned models with different Glove embedding techniques was compared based on their accuracy on training and validation datasets.

LSTM Hypertuned Model with Sequential Glove Embedding:

  • Achieved an accuracy of 74% on the training dataset.
  • Achieved an accuracy of 74% on the validation dataset. While the model demonstrates decent performance, its accuracy is lower compared to the other model in this comparison.

LSTM Hypertuned Model with Average Glove Embedding:

  • This model outperformed the sequential embedding model.
  • Achieved higher accuracy on both the training and validation datasets, making it more suitable for the use case.

Conclusion:

For the given use case, the LSTM Hypertuned Model with Average Glove Embedding is identified as the better choice due to its superior performance in terms of accuracy compared to the LSTM Hypertuned Model with Sequential Glove Embedding.

Exploration of User Interface¶

In [ ]:
! pip install streamlit
!pip install pyngrok
Requirement already satisfied: streamlit in /usr/local/lib/python3.10/dist-packages (1.40.2)
Requirement already satisfied: altair<6,>=4.0 in /usr/local/lib/python3.10/dist-packages (from streamlit) (4.2.2)
Requirement already satisfied: blinker<2,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from streamlit) (1.9.0)
Requirement already satisfied: cachetools<6,>=4.0 in /usr/local/lib/python3.10/dist-packages (from streamlit) (5.5.0)
Requirement already satisfied: click<9,>=7.0 in /usr/local/lib/python3.10/dist-packages (from streamlit) (8.1.7)
Requirement already satisfied: numpy<3,>=1.23 in /usr/local/lib/python3.10/dist-packages (from streamlit) (1.26.4)
Requirement already satisfied: packaging<25,>=20 in /usr/local/lib/python3.10/dist-packages (from streamlit) (24.2)
Requirement already satisfied: pandas<3,>=1.4.0 in /usr/local/lib/python3.10/dist-packages (from streamlit) (2.2.2)
Requirement already satisfied: pillow<12,>=7.1.0 in /usr/local/lib/python3.10/dist-packages (from streamlit) (11.0.0)
Requirement already satisfied: protobuf<6,>=3.20 in /usr/local/lib/python3.10/dist-packages (from streamlit) (4.25.5)
Requirement already satisfied: pyarrow>=7.0 in /usr/local/lib/python3.10/dist-packages (from streamlit) (17.0.0)
Requirement already satisfied: requests<3,>=2.27 in /usr/local/lib/python3.10/dist-packages (from streamlit) (2.32.3)
Requirement already satisfied: rich<14,>=10.14.0 in /usr/local/lib/python3.10/dist-packages (from streamlit) (13.9.4)
Requirement already satisfied: tenacity<10,>=8.1.0 in /usr/local/lib/python3.10/dist-packages (from streamlit) (9.0.0)
Requirement already satisfied: toml<2,>=0.10.1 in /usr/local/lib/python3.10/dist-packages (from streamlit) (0.10.2)
Requirement already satisfied: typing-extensions<5,>=4.3.0 in /usr/local/lib/python3.10/dist-packages (from streamlit) (4.12.2)
Requirement already satisfied: watchdog<7,>=2.1.5 in /usr/local/lib/python3.10/dist-packages (from streamlit) (6.0.0)
Requirement already satisfied: gitpython!=3.1.19,<4,>=3.0.7 in /usr/local/lib/python3.10/dist-packages (from streamlit) (3.1.43)
Requirement already satisfied: pydeck<1,>=0.8.0b4 in /usr/local/lib/python3.10/dist-packages (from streamlit) (0.9.1)
Requirement already satisfied: tornado<7,>=6.0.3 in /usr/local/lib/python3.10/dist-packages (from streamlit) (6.3.3)
Requirement already satisfied: entrypoints in /usr/local/lib/python3.10/dist-packages (from altair<6,>=4.0->streamlit) (0.4)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from altair<6,>=4.0->streamlit) (3.1.4)
Requirement already satisfied: jsonschema>=3.0 in /usr/local/lib/python3.10/dist-packages (from altair<6,>=4.0->streamlit) (4.23.0)
Requirement already satisfied: toolz in /usr/local/lib/python3.10/dist-packages (from altair<6,>=4.0->streamlit) (0.12.1)
Requirement already satisfied: gitdb<5,>=4.0.1 in /usr/local/lib/python3.10/dist-packages (from gitpython!=3.1.19,<4,>=3.0.7->streamlit) (4.0.11)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas<3,>=1.4.0->streamlit) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas<3,>=1.4.0->streamlit) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/dist-packages (from pandas<3,>=1.4.0->streamlit) (2024.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.27->streamlit) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.27->streamlit) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.27->streamlit) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.27->streamlit) (2024.8.30)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from rich<14,>=10.14.0->streamlit) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich<14,>=10.14.0->streamlit) (2.18.0)
Requirement already satisfied: smmap<6,>=3.0.1 in /usr/local/lib/python3.10/dist-packages (from gitdb<5,>=4.0.1->gitpython!=3.1.19,<4,>=3.0.7->streamlit) (5.0.1)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->altair<6,>=4.0->streamlit) (3.0.2)
Requirement already satisfied: attrs>=22.2.0 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=3.0->altair<6,>=4.0->streamlit) (24.2.0)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=3.0->altair<6,>=4.0->streamlit) (2024.10.1)
Requirement already satisfied: referencing>=0.28.4 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=3.0->altair<6,>=4.0->streamlit) (0.35.1)
Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=3.0->altair<6,>=4.0->streamlit) (0.21.0)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py>=2.2.0->rich<14,>=10.14.0->streamlit) (0.1.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas<3,>=1.4.0->streamlit) (1.16.0)
Requirement already satisfied: pyngrok in /usr/local/lib/python3.10/dist-packages (7.2.1)
Requirement already satisfied: PyYAML>=5.1 in /usr/local/lib/python3.10/dist-packages (from pyngrok) (6.0.2)
In [ ]:
# Write Streamlit app code
app_code = """
import streamlit as st

# Title and UI elements
st.title("Streamlit App in Google Colab")
st.sidebar.header("User Inputs")

# Input fields
name = st.sidebar.text_input("Enter your name:", "")
age = st.sidebar.number_input("Enter your age:", min_value=1, max_value=100, step=1)

# Display data
if st.sidebar.button("Submit"):
    st.write(f"Hello, {name}!")
    st.write(f"You are {age} years old.")
"""

# Save to a file
with open('app.py', 'w') as f:
    f.write(app_code)

print("Streamlit app saved as app.py")
Streamlit app saved as app.py
In [ ]:
!ngrok config add-authtoken 2pX6HpuEogHS7K69APX2wL1ygMt_6b76BwiMW1oonM6emUvcU
Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml
In [ ]:
!pip install streamlit pyngrok

from pyngrok import ngrok

# Create Streamlit app
app_code = """
import streamlit as st

st.title("Streamlit App in Google Colab")
st.sidebar.header("User Inputs")

# Input fields
name = st.sidebar.text_input("Enter your name:", "")
age = st.sidebar.number_input("Enter your age:", min_value=1, max_value=100, step=1)

# Display data
if st.sidebar.button("Submit"):
    st.write(f"Hello, {name}!")
    st.write(f"You are {age} years old.")
"""

# Save to a file
with open('app.py', 'w') as f:
    f.write(app_code)

# Start Streamlit server
!streamlit run app.py &>/dev/null&
public_url = ngrok.connect(8501)
print(f"Streamlit app is live at {public_url}")
Requirement already satisfied: streamlit in /usr/local/lib/python3.10/dist-packages (1.40.2)
Requirement already satisfied: pyngrok in /usr/local/lib/python3.10/dist-packages (7.2.1)
Requirement already satisfied: altair<6,>=4.0 in /usr/local/lib/python3.10/dist-packages (from streamlit) (4.2.2)
Requirement already satisfied: blinker<2,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from streamlit) (1.9.0)
Requirement already satisfied: cachetools<6,>=4.0 in /usr/local/lib/python3.10/dist-packages (from streamlit) (5.5.0)
Requirement already satisfied: click<9,>=7.0 in /usr/local/lib/python3.10/dist-packages (from streamlit) (8.1.7)
Requirement already satisfied: numpy<3,>=1.23 in /usr/local/lib/python3.10/dist-packages (from streamlit) (1.26.4)
Requirement already satisfied: packaging<25,>=20 in /usr/local/lib/python3.10/dist-packages (from streamlit) (24.2)
Requirement already satisfied: pandas<3,>=1.4.0 in /usr/local/lib/python3.10/dist-packages (from streamlit) (2.2.2)
Requirement already satisfied: pillow<12,>=7.1.0 in /usr/local/lib/python3.10/dist-packages (from streamlit) (11.0.0)
Requirement already satisfied: protobuf<6,>=3.20 in /usr/local/lib/python3.10/dist-packages (from streamlit) (4.25.5)
Requirement already satisfied: pyarrow>=7.0 in /usr/local/lib/python3.10/dist-packages (from streamlit) (17.0.0)
Requirement already satisfied: requests<3,>=2.27 in /usr/local/lib/python3.10/dist-packages (from streamlit) (2.32.3)
Requirement already satisfied: rich<14,>=10.14.0 in /usr/local/lib/python3.10/dist-packages (from streamlit) (13.9.4)
Requirement already satisfied: tenacity<10,>=8.1.0 in /usr/local/lib/python3.10/dist-packages (from streamlit) (9.0.0)
Requirement already satisfied: toml<2,>=0.10.1 in /usr/local/lib/python3.10/dist-packages (from streamlit) (0.10.2)
Requirement already satisfied: typing-extensions<5,>=4.3.0 in /usr/local/lib/python3.10/dist-packages (from streamlit) (4.12.2)
Requirement already satisfied: watchdog<7,>=2.1.5 in /usr/local/lib/python3.10/dist-packages (from streamlit) (6.0.0)
Requirement already satisfied: gitpython!=3.1.19,<4,>=3.0.7 in /usr/local/lib/python3.10/dist-packages (from streamlit) (3.1.43)
Requirement already satisfied: pydeck<1,>=0.8.0b4 in /usr/local/lib/python3.10/dist-packages (from streamlit) (0.9.1)
Requirement already satisfied: tornado<7,>=6.0.3 in /usr/local/lib/python3.10/dist-packages (from streamlit) (6.3.3)
Requirement already satisfied: PyYAML>=5.1 in /usr/local/lib/python3.10/dist-packages (from pyngrok) (6.0.2)
Requirement already satisfied: entrypoints in /usr/local/lib/python3.10/dist-packages (from altair<6,>=4.0->streamlit) (0.4)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from altair<6,>=4.0->streamlit) (3.1.4)
Requirement already satisfied: jsonschema>=3.0 in /usr/local/lib/python3.10/dist-packages (from altair<6,>=4.0->streamlit) (4.23.0)
Requirement already satisfied: toolz in /usr/local/lib/python3.10/dist-packages (from altair<6,>=4.0->streamlit) (0.12.1)
Requirement already satisfied: gitdb<5,>=4.0.1 in /usr/local/lib/python3.10/dist-packages (from gitpython!=3.1.19,<4,>=3.0.7->streamlit) (4.0.11)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas<3,>=1.4.0->streamlit) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas<3,>=1.4.0->streamlit) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/dist-packages (from pandas<3,>=1.4.0->streamlit) (2024.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.27->streamlit) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.27->streamlit) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.27->streamlit) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.27->streamlit) (2024.8.30)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from rich<14,>=10.14.0->streamlit) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich<14,>=10.14.0->streamlit) (2.18.0)
Requirement already satisfied: smmap<6,>=3.0.1 in /usr/local/lib/python3.10/dist-packages (from gitdb<5,>=4.0.1->gitpython!=3.1.19,<4,>=3.0.7->streamlit) (5.0.1)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->altair<6,>=4.0->streamlit) (3.0.2)
Requirement already satisfied: attrs>=22.2.0 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=3.0->altair<6,>=4.0->streamlit) (24.2.0)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=3.0->altair<6,>=4.0->streamlit) (2024.10.1)
Requirement already satisfied: referencing>=0.28.4 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=3.0->altair<6,>=4.0->streamlit) (0.35.1)
Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=3.0->altair<6,>=4.0->streamlit) (0.21.0)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py>=2.2.0->rich<14,>=10.14.0->streamlit) (0.1.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas<3,>=1.4.0->streamlit) (1.16.0)
WARNING:pyngrok.process.ngrok:t=2024-11-29T18:00:13+0000 lvl=warn msg="failed to start tunnel" pg=/api/tunnels id=8074cde44355bb5e err="failed to start tunnel: Your account may not run more than 3 tunnels over a single ngrok agent session.\nThe tunnels already running on this session are:\ntn_2pX6fxR7MG7Y7GjCKCTSqUck54G, tn_2pX6pkBYQSrIkk7FgyPcgFHoist, tn_2pXAqIbZujzpefdXoD0hAC5PFnU\n\r\n\r\nERR_NGROK_324\r\n"
---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/pyngrok/ngrok.py in api_request(url, method, data, params, timeout, auth)
    521     try:
--> 522         response = urlopen(request, encoded_data, timeout)
    523         response_data = response.read().decode("utf-8")

/usr/lib/python3.10/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    215         opener = _opener
--> 216     return opener.open(url, data, timeout)
    217 

/usr/lib/python3.10/urllib/request.py in open(self, fullurl, data, timeout)
    524             meth = getattr(processor, meth_name)
--> 525             response = meth(req, response)
    526 

/usr/lib/python3.10/urllib/request.py in http_response(self, request, response)
    633         if not (200 <= code < 300):
--> 634             response = self.parent.error(
    635                 'http', request, response, code, msg, hdrs)

/usr/lib/python3.10/urllib/request.py in error(self, proto, *args)
    562             args = (dict, 'default', 'http_error_default') + orig_args
--> 563             return self._call_chain(*args)
    564 

/usr/lib/python3.10/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
    495             func = getattr(handler, meth_name)
--> 496             result = func(*args)
    497             if result is not None:

/usr/lib/python3.10/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
    642     def http_error_default(self, req, fp, code, msg, hdrs):
--> 643         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    644 

HTTPError: HTTP Error 502: Bad Gateway

During handling of the above exception, another exception occurred:

PyngrokNgrokHTTPError                     Traceback (most recent call last)
<ipython-input-70-ec59c0054ac0> in <cell line: 28>()
     26 # Start Streamlit server
     27 get_ipython().system('streamlit run app.py &>/dev/null&')
---> 28 public_url = ngrok.connect(8501)
     29 print(f"Streamlit app is live at {public_url}")

/usr/local/lib/python3.10/dist-packages/pyngrok/ngrok.py in connect(addr, proto, name, pyngrok_config, **options)
    318     logger.debug(f"Creating tunnel with options: {options}")
    319 
--> 320     tunnel = NgrokTunnel(api_request(f"{api_url}/api/tunnels", method="POST", data=options,
    321                                      timeout=pyngrok_config.request_timeout),
    322                          pyngrok_config, api_url)

/usr/local/lib/python3.10/dist-packages/pyngrok/ngrok.py in api_request(url, method, data, params, timeout, auth)
    541         logger.debug(f"Response {status_code}: {response_data.strip()}")
    542 
--> 543         raise PyngrokNgrokHTTPError(f"ngrok client exception, API returned {status_code}: {response_data}",
    544                                     e.url,
    545                                     status_code, e.reason, e.headers, response_data)

PyngrokNgrokHTTPError: ngrok client exception, API returned 502: {"error_code":103,"status_code":502,"msg":"failed to start tunnel","details":{"err":"failed to start tunnel: Your account may not run more than 3 tunnels over a single ngrok agent session.\nThe tunnels already running on this session are:\ntn_2pX6fxR7MG7Y7GjCKCTSqUck54G, tn_2pX6pkBYQSrIkk7FgyPcgFHoist, tn_2pXAqIbZujzpefdXoD0hAC5PFnU\n\r\n\r\nERR_NGROK_324\r\n"}}